# **Tools and Algorithms for the Construction and Analysis of Systems**

**27th International Conference, TACAS 2021 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021 Luxembourg City, Luxembourg, March 27 – April 1, 2021 Proceedings, Part II**

### Lecture Notes in Computer Science 12652

Founding Editors

Gerhard Goos, Germany Juris Hartmanis, USA

#### Editorial Board Members

Elisa Bertino, USA Wen Gao, China Bernhard Steffen , Germany Gerhard Woeginger , Germany Moti Yung, USA

### Advanced Research in Computing and Software Science Subline of Lecture Notes in Computer Science

Subline Series Editors

Giorgio Ausiello, University of Rome 'La Sapienza', Italy Vladimiro Sassone, University of Southampton, UK

Subline Advisory Board

Susanne Albers, TU Munich, Germany Benjamin C. Pierce, University of Pennsylvania, USA Bernhard Steffen , University of Dortmund, Germany Deng Xiaotie, Peking University, Beijing, China Jeannette M. Wing, Microsoft Research, Redmond, WA, USA More information about this subseries at http://www.springer.com/series/7407

# Tools and Algorithms for the Construction and Analysis of Systems

27th International Conference, TACAS 2021 Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2021 Luxembourg City, Luxembourg, March 27 – April 1, 2021 Proceedings, Part II

Editors Jan Friso Groote Eindhoven University of Technology Eindhoven, The Netherlands

Kim Guldstrand Larsen Aalborg University Aalborg East, Denmark

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-72012-4 ISBN 978-3-030-72013-1 (eBook) https://doi.org/10.1007/978-3-030-72013-1

LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues

© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

### ETAPS Foreword

Welcome to the 24th ETAPS! ETAPS 2021 was originally planned to take place in Luxembourg in its beautiful capital Luxembourg City. Because of the Covid-19 pandemic, this was changed to an online event.

ETAPS 2021 was the 24th instance of the European Joint Conferences on Theory and Practice of Software. ETAPS is an annual federated conference established in 1998, and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each conference has its own Program Committee (PC) and its own Steering Committee (SC). The conferences cover various aspects of software systems, ranging from theoretical computer science to foundations of programming languages, analysis tools, and formal approaches to software engineering. Organising these conferences in a coherent, highly synchronised conference programme enables researchers to participate in an exciting event, having the possibility to meet many colleagues working in different directions in the field, and to easily attend talks of different conferences. On the weekend before the main conference, numerous satellite workshops take place that attract many researchers from all over the globe.

ETAPS 2021 received 260 submissions in total, 115 of which were accepted, yielding an overall acceptance rate of 44.2%. I thank all the authors for their interest in ETAPS, all the reviewers for their reviewing efforts, the PC members for their contributions, and in particular the PC (co-)chairs for their hard work in running this entire intensive process. Last but not least, my congratulations to all authors of the accepted papers!

ETAPS 2021 featured the unifying invited speakers Scott Smolka (Stony Brook University) and Jane Hillston (University of Edinburgh) and the conference-specific invited speakers Işil Dillig (University of Texas at Austin) for ESOP and Willem Visser (Stellenbosch University) for FASE. Inivited tutorials were provided by Erika Ábrahám (RWTH Aachen University) on analysis of hybrid systems and Madhusudan Parthasararathy (University of Illinois at Urbana-Champaign) on combining machine learning and formal methods.

ETAPS 2021 was originally supposed to take place in Luxembourg City, Luxembourg organized by the SnT - Interdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg. University of Luxembourg was founded in 2003. The university is one of the best and most international young universities with 6,700 students from 129 countries and 1,331 academics from all over the globe. The local organisation team consisted of Peter Y.A. Ryan (general chair), Peter B. Roenne (organisation chair), Joaquin Garcia-Alfaro (workshop chair), Magali Martin (event manager), David Mestel (publicity chair), and Alfredo Rial (local proceedings chair).

ETAPS 2021 was further supported by the following associations and societies: ETAPS e.V., EATCS (European Association for Theoretical Computer Science), EAPLS (European Association for Programming Languages and Systems), and EASST (European Association of Software Science and Technology).

The ETAPS Steering Committee consists of an Executive Board, and representatives of the individual ETAPS conferences, as well as representatives of EATCS, EAPLS, and EASST. The Executive Board consists of Holger Hermanns (Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofron (Prague), Barbara König (Duisburg), Gerald Lüttgen (Bamberg), Caterina Urban (INRIA), Tarmo Uustalu (Reykjavik and Tallinn), and Lenore Zuck (Chicago).

Other members of the steering committee are: Patricia Bouyer (Paris), Einar Broch Johnsen (Oslo), Dana Fisman (Be'er Sheva), Jan Friso Groote (Eindhoven), Esther Guerra (Madrid), Reiko Heckel (Leicester), Joost-Pieter Katoen (Aachen and Twente), Stefan Kiefer (Oxford), Fabrice Kordon (Paris), Jan Křetínský (Munich), Kim G. Larsen (Aalborg), Tiziana Margaria (Limerick), Andrew M. Pitts (Cambridge), Grigore Roșu (Illinois), Peter Ryan (Luxembourg), Don Sannella (Edinburgh), Lutz Schröder (Erlangen), Ilya Sergey (Singapore), Mariëlle Stoelinga (Twente), Gabriele Taentzer (Marburg), Christine Tasson (Paris), Peter Thiemann (Freiburg), Jan Vitek (Prague), Anton Wijs (Eindhoven), Manuel Wimmer (Linz), and Nobuko Yoshida (London).

I'd like to take this opportunity to thank all the authors, attendees, organizers of the satellite workshops, and Springer-Verlag GmbH for their support. I hope you all enjoyed ETAPS 2021.

Finally, a big thanks to Peter, Peter, Magali and their local organisation team for all their enormous efforts to make ETAPS a fantastic online event. I hope there will be a next opportunity to host ETAPS in Luxembourg.

February 2021 Marieke Huisman ETAPS SC Chair ETAPS e.V. President

### Preface

TACAS 2021 was the 27th edition of the International Conference on Tools and Algorithms for the Construction and Analysis of Systems conference series. TACAS 2021 was part of the 24th European Joint Conferences on Theory and Practice of Software (ETAPS 2021), which although originally planned to take place in Luxembourg City, was held as an online event on March 27 to April 1 due the the COVID-19 pandemic.

TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, flexibility, and efficiency of tools and algorithms for building computer-controlled systems. There were four types of submissions for TACAS:


This year 141 papers were submitted to TACAS, consisting of 90 research papers, 29 regular tool papers, 16 tool demo papers, and 6 case study papers. Authors were allowed to submit up to four papers. Each paper was reviewed by three Program Committee (PC) members, who made extensive use of subreviewers.

Similarly to previous years, it was possible to submit an artifact alongside a paper, which was mandatory for regular tool and tool demo papers. An artifact might consist of a tool, models, proofs, or other data required for validation of the results of the paper. The Artifact Evaluation Committee (AEC) was tasked with reviewing the artifacts, based on their documentation, ease of use, and, most importantly, whether the results presented in the corresponding paper could be accurately reproduced. Most of the evaluation was carried out using a standardised virtual machine to ensure consistency of the results, except for those artifacts that had special hardware requirements.

The evaluation consisted of two rounds. The first round was carried out in parallel with the work of the PC. The judgment of the AEC was communicated to the PC and weighed in their discussion. The second round took place after paper acceptance notifications were sent out; authors of accepted research papers who did not submit an artifact in the first round could submit their artifact here. In total, 72 artifacts were submitted (63 in the first round and 9 in the second), of which 57 were accepted and 15 rejected. This corresponds to an acceptance rate of 79 percent. Papers with an accepted artifact include a badge on the first page.

Selected authors were requested to provide a rebuttal for both papers and artifacts in case a review gave rise to questions. In total 166 rebuttals were provided. Using the review reports and rebuttals the Programme and the Artifact Evaluation Committees extensively discussed the papers and artifacts and ultimately decided to accept 32 research papers, 7 tool papers, 6 tool demos, and 2 case studies.

Besides the regular conference papers, this two-volume proceedings also contains 8 short papers that describe the participating verification systems and a competition report presenting the results of the 10th SV-COMP, the competition on automatic software verifiers for C and Java programs. These papers were reviewed by a separate program committee (PC); each of the papers was assessed by at least three reviewers. A total of 30 verification systems with developers from 11 countries entered the systematic comparative evaluation, including four submissions from industry. Two sessions in the TACAS program were reserved for the presentation of the results: (1) a summary by the competition chair and of the participating tools by the developer teams in the first session, and (2) an open community meeting in the second session.

March/April 2021 Jan Friso Groote Kim Guldstrand Larsen Frédéric Lang Thierry Lecomte Thomas Neele Peter Gjøl Jensen Dirk Beyer Alfredo Rial

### Organization

#### Program Committee (TACAS)

Goran Frehse ENSTA Paris, France Kim Guldstrand Larsen (Chair) Mieke Massink CNR-ISTI, Italy Radu Mateescu Inria, France

Christel Baier TU Dresden, Germany Dirk Beyer LMU Munich, Germany Armin Biere Johannes Kepler University Linz, Austria Valentina Castiglioni Reykjavik University, Iceland Alessandro Cimatti Fondazione Bruno Kessler, Italy Rance Cleaveland University of Maryland, USA Pedro R. D'Argenio Universidad Nacional de Córdoba - CONICET, Argentina Yuxin Deng East China Normal University, China Carla Ferreira Universidade NOVA de Lisboa, Portugal Susanne Graf Université Grenoble Alpes/CNRS/VERIMAG, France Jan Friso Groote (Chair) Eindhoven University of Technology, Netherlands Orna Grumberg Technion - Israel Institute of Technology, Israel Aalborg University, Denmark Klaus Havelund Jet Propulsion Laboratory, USA Holger Hermanns Saarland University, Germany Peter Höfner Australian National University, Australia Hossein Hojjat Rochester Institute of Technology, USA Falk Howar TU Dortmund, Germany David N. Jansen Institute of Software, Chinese Academy of Sciences, China Marcin Jurdziński The University of Warwick, Great Britain Joost-Pieter Katoen RWTH Aachen/Universiteit Twente, Germany/Netherlands Jeroen J. A. Keiren Eindhoven University of Technology, Netherlands Sophia Knight University of Minnesota, USA Laura Kovács Vienna University of Technology, Austria Jan Křetínský Technical University of Munich, Germany Alfons Laarman Leiden University, Netherlands Frédéric Lang Inria Grenoble - Rhône-Alpes/CONVECS, France Thierry Lecomte ClearSy Systems Engineering, France Xinxin Liu Institute of Software, Chinese Academy of Sciences, China Jun Pang University of Luxembourg, Luxembourg


#### Artifact Evaluation Committee – AEC

Elvio Gilberto Amparore University of Turin, Italy Jesús Mauricio Chimento KTH, Sweden Hans-Dieter Hiep CWI, Netherlands Mitja Kulczynski Kiel University, Germany Etienne Renault LRDE, France

Haniel Barbosa Universidade Federal de Minas Gerais, France František Blahoudek University of Texas at Austin, USA Olav Bunte Eindhoven University of Technology, Netherlands Damien Busatto-Gaston Université Libre de Bruxelles, Belgium Nathalie Cauchi University of Oxford, Great Britain Joshua Dawes University of Luxembourg, Luxembourg Mathias Fleury Johannes Kepler University Linz, Austria Daniel J. Fremont University of California, Santa Cruz, USA Manuel Gieseking University of Oldenburg, Germany Peter Gjøl Jensen (Chair) Aalborg University, Denmark Kush Grover Technical University of Munich, Germany Daniela Kaufmann Johannes Kepler University Linz, Austria Alfons Laarman Leiden University, Netherlands Luca Laurenti University of Oxford, Great Britain Maurice Laveaux Eindhoven University of Technology, Netherlands Yong Li Institute of Software, Chinese Academy of Sciences, China Debasmita Lohar Max Planck Institute for Software Systems, Germany Viktor Malík Brno University of Technology, Czech Republic Joshua Moerman RWTH Aachen University, Germany Stefanie Mohr Technische Universität München, Germany Marco Muñiz Aalborg University, Denmark Thomas Neele (Chair) Royal Holloway University of London, Great Britain Wytse Oortwijn University of Twente, Netherlands Elizabeth Polgreen University of Edinburgh, Great Britain José Proenca CISTER-ISEP and HASLab-INESC TEC, Portugal Alceste Scalas Technical University of Denmark, Denmark Morten Konggaard Schou Aalborg University, Denmark Veronika Šoková Brno University of Technology, Czech Republic Yoni Zohar Stanford University, USA

#### Program Committee and Jury – SV-COMP


#### Steering Committee

Dirk Beyer LMU Munich, Germany Rance Cleaveland University of Maryland, USA Holger Hermanns Saarland University, Germany


#### Additional Reviewers

Abate, Carmine Achilleos, Antonis Akshay, S. Andriushchenko, Roman André, Étienne Asadi, Sepideh Ashok, Pranav Azeem, Muqsit Bannister, Callum Barnett, Lee Basile, Davide Batz, Kevin Baumgartner, Peter Becchi, Anna ter Beek, Maurice H. Bendík, Jaroslav Bensalem, Saddek van der Berg, Freark Berg, Jeremias Berger, Philipp Bernardo, Marco Biewer, Sebastian Bischopink, Christopher Blicha, Martin Bønneland, Frederik M. Bouvier, Pierre Bozzano, Marco Brellmann, David Broccia, Giovanna Budde, Carlos E. Bursuc, Sergiu Cassel, Sofia Castro, Pablo Chalupa, Marek Chen, Mingshuai Chiang, James Ciancia, Vincenzo Ciesielski, Maciej

Clement, Bradley Coenen, Norine Cubuktepe, Murat Degiovanni, Renzo Demasi, Ramiro Dierl, Simon Dixon, Alex van Dijk, Tom Donatelli, Susanna Dongol, Brijesh Edera, Alejandro Eisentraut, Julia Emmi, Michael Evangelidis, Alexandros Fedotov, Alexander Fedyukovich, Grigory Fehnker, Ansgar Feng, Weizhi Ferreira, Francisco Fleury, Mathias Freiberger, Felix Frenkel, Hadar Friedberger, Karlheinz Fränzle, Martin Funke, Florian Gallá, Francesco Garavel, Hubert Geatti, Luca Gengelbach, Arve Goodloe, Alwyn Goorden, Martijn Goudsmid, Ohad Griggio, Alberto Groce, Alex Grover, Kush Hafidi, Yousra Hallé, Sylvain Hecking-Harbusch, Jesko Heizmann, Matthias Holzner, Stephan Holík, Lukáš Hyvärinen, Antti Irfan, Ahmed Javed, Omar Jensen, Mathias Claus Jonas, Martin Junges, Sebastian Käfer, Nikolai Kanav, Sudeep Kapus, Timotej Kauffman, Sean Khamespanah, Ehsan Kheireddine, Anissa Kiviriga, Andrej Klauck, Michaela Kobayashi, Naoki Köhl, Maximilian Alexander Kozachinskiy, Alexander Kutsia, Temur Lahkim Bennani, Ismail Lammich, Peter Lang, Frédéric Lanotte, Ruggero Latella, Diego Laurenti, Luca Ledent, Philippe Lehtinen, Karoliina Lemberger, Thomas Li, Jianlin Li, Qin Li, Xie Li, Xin Lin, Shaokai Lion, Benjamin Liu, Jiaxiang Liu, Wanwei Loreti, Michele Magnago, Enrico Major, Juraj Marché, Claude Mariegaard, Anders Marsso, Lina Mauritz, Malte McClurg, Jedidiah

Meggendorfer, Tobias Metzger, Niklas Meyer, Roland Micheli, Andrea Mittelmann, Munyque Mizera, Andrzej Moerman, Joshua Mohr, Stefanie Mora, Federico Mover, Sergio Mues, Malte Muller, Lucie Muroor-Nadumane, Ajay Möhle, Sibylle Neele, Thomas Noll, Thomas Norman, Gethin Otoni, Rodrigo Parys, Paweł Pattinson, Dirk Pavela, Jiří Pena, Lucas Pinault, Laureline Piribauer, Jakob Pirogov, Anton Pommellet, Adrien Quatmann, Tim Rappoport, Omer Raskin, Jean-François Rothenberg, Bat-Chen Rouquette, Nicolas Rümmer, Philipp S., Krishna Šafránek, David Sankaranarayanan, Sriram Schallau, Till Schupp, Stefan Serwe, Wendelin Shafiei, Nastaran Shi, Xiaomu Síč, Juraj Sickert, Salomon Singh, Gagandeep Slivovsky, Friedrich Sølvsten, Steffan Song, Fu

Spel, Jip Srivathsan, B. Stankovic, Miroslav Stock, Gregory Strej ček, Jan Su, Cui Suda, Martin Sun, Jun Svozil, Alexander Tian, Chun Tibo, Alessandro Tini, Simone Tonetta, Stefano Trt ík, Marek Turrini, Andrea

Vandin, Andrea Weber, Tjark Weininger, Maximilian Wendler, Philipp Wolf, Karsten Wolovick, Nicol á s Wu, Zhilin Xu, Ming Yang, Pengfei Yang, Xiaoxiao Zhan, Naijun Zhang, Min Zhang, Wenbo Zhang, Wenhui Zhao, Hengjun

### Contents – Part II

#### Verification Techniques (not SMT)


#### Tool Papers



### Contents – Part I

#### Game Theory


and Joost-Pieter Katoen




## **Verification Techniques (not SMT)**

#### **Directed Reachability for Infinite-State Systems***-*

Michael Blondin<sup>1</sup> , Christoph Haase<sup>2</sup> , and Philip Offtermatt1,<sup>3</sup> (-

<sup>1</sup> Universit´e de Sherbrooke, Sherbrooke, Canada {michael.blondin, philip.offtermatt}@usherbrooke.ca <sup>2</sup> University of Oxford, Oxford, United Kingdom christoph.haase@cs.ox.ac.uk <sup>3</sup> Max Planck Institute for Software Systems, Saarbr¨ucken, Germany

**Abstract.** Numerous tasks in program analysis and synthesis reduce to deciding reachability in possibly infinite graphs such as those induced by Petri nets. However, the Petri net reachability problem has recently been shown to require non-elementary time, which raises questions about the practical applicability of Petri nets as target models. In this paper, we introduce a novel approach for efficiently semi-deciding the reachability problem for Petri nets in practice. Our key insight is that computationally lightweight over-approximations of Petri nets can be used as distance oracles in classical graph exploration algorithms such as A<sup>∗</sup> and greedy best-first search. We provide and evaluate a prototype implementation of our approach that outperforms existing state-of-the-art tools, sometimes by orders of magnitude, and which is also competitive with domain-specific tools on benchmarks coming from program synthesis and concurrent program analysis.

**Keywords:** Petri nets · reachability · shortest paths · model checking

#### **1 Introduction**

Many problems in program analysis, synthesis and verification reduce to deciding reachability of a vertex or a set of vertices in infinite graphs, e.g., when reasoning about concurrent programs with an unbounded number of threads, or when arbitrarily many components can be used in a synthesis task. For automated reasoning tasks, those infinite graphs are finitely represented by some mathematical model. Finding the right such model requires a trade-off between the two conflicting goals of maximal expressive power and computational feasibility of the relevant decision problems. Petri nets are a ubiquitous mathematical model that provides a good compromise between those two goals. They are

<sup>-</sup> An extended version containing full proofs as well as a primer on applications of the Petri net reachability problem can be obtained from: arxiv.org/abs/2010.07912. This work is part of a project that has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant agreement No. 852769, ARiAT). It is also supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC). Parts of this research were carried out while the second author was affiliated with the Department of Computer Science, University College London, UK.

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 3–23, 2021. https://doi.org/10.1007/978-3-030-72013-1 1

expressive enough to find a plethora of applications in computer science, in particular in the analysis of concurrent processes, yet the reachability problem for Petri nets is decidable [47,40,41,43]. Counter abstraction has evolved as a generic abstraction paradigm that reduces a variety of program analysis tasks to problems in Petri nets or variants thereof such as well-structured transition systems, see e.g. [30,39,61,5]. Due to their generality and versatility, Petri nets and their extensions find numerous applications also in other areas, including the design and analysis of protocols [22], business processes [57], biological systems [33,11] and chemical systems [2]. The goal of this paper is to introduce and evaluate an efficient generic approach to deciding the Petri net reachability problem on instances arising from applications in program verification and synthesis.

A Petri net comprises a finite set of places with a finite number of transitions. Places carry a finite yet unbounded number of tokens and transitions can remove and add tokens to places. A marking specifies how many tokens each place carries. An example of a Petri net is given on the left-hand side of Figure 1, where the two places {p1, p2} are depicted as circles and transitions {t1, t2, t3} as squares. Places carry tokens depicted as filled circles; thus p<sup>1</sup> carries one token and p<sup>2</sup> carries none. We write this as [p<sup>1</sup> : 1, p<sup>2</sup> : 0], or (1, 0) if there is a clear ordering on the places. Transition t<sup>1</sup> can add a single token to place p<sup>1</sup> at any moment. As soon as a token is present in p1, it can be consumed by transition t2, which then adds a token to place p<sup>2</sup> and puts back one token to place p1. Finally, transition t<sup>3</sup> consumes tokens from p<sup>1</sup> without adding any token at all.

**Fig. 1.** Left: A Petri net N . Right: Search of the forthcoming Algorithm 1 over the graph GN(N ) from (0, 0) to (0, 1), where (x, y) denotes [p<sup>1</sup> : x, p<sup>2</sup> : y] and each number in a box next to a marking is its heuristic value. Only the blue region is expanded.

A Petri net induces a possibly infinite directed graph whose vertices are markings, and whose edges are determined by the transitions of the Petri net, cf. the right side of Figure 1. Given two markings, the reachability problem asks whether they are connected in this graph. In Figure 1, the marking (0, 1) is reachable from (0, 0), e.g., via paths of lengths 3 and 5: (0, 0) <sup>t</sup><sup>1</sup> −→ (1, 0) <sup>t</sup><sup>2</sup> −→ (1, 1) <sup>t</sup><sup>3</sup> −→ (0, 1) and (0, 0) <sup>t</sup><sup>1</sup> −→ (1, 0) <sup>t</sup><sup>1</sup> −→ (2, 0) <sup>t</sup><sup>2</sup> −→ (2, 1) <sup>t</sup><sup>3</sup> −→ (1, 1) <sup>t</sup><sup>3</sup> −→ (0, 1).

In practice, the Petri net reachability problem is a challenging decision problem due to its horrendous worst-case complexity: an exponential-space lower bound was established in the 1970s [45], and a non-elementary time lower bound has only recently been established [13]. One may thus question whether a problem with such high worst-case complexity is of any practical relevance, and whether reducing program analysis tasks to Petri net reachability is anything else than merely an intellectual exercise. We debunk those concerns and present a technique which decides most reachability instances appearing in the wild. When evaluated on large-scale instances involving Petri nets with thousands of places and tens of thousands of transitions, our prototype implementation is most of the time faster, even up to several orders of magnitude on large-scale instances, and solves more instances than existing state-of-the-art tools. Our implementation is also competitive with specialized domain-specific tools. One of the biggest advantages of our approach is that it is extremely simple to describe and implement, and it readily generalizes to many extensions of Petri nets. In fact, it was surprising to us that our approach has not yet been discovered. We now describe the main observations and techniques underlying our approach.

Ever since the early days of research in Petri nets, state-space over-approximations have been studied to attenuate the high computational complexity of their decision problems. One such over-approximation is, informally speaking, to allow places to carry a negative number of tokens. Deciding reachability then reduces to solving the so-called state equation, a system of linear equations associated to a Petri net. Another over-approximation are continuous Petri nets, a variant where places carry fractional tokens and "fractions of transitions" can be applied [14]. The benefit is that deciding reachability drops down to polynomial time [25]. While those approximations have been applied for pruning search spaces, see e.g. [23,4,8,29], we make the following simple key observation:

If a marking *m* is reachable from an initial marking in an overapproximation, then the length of a shortest witnessing path in the overapproximation lower bounds the length of a shortest path reaching *m*.

The availability of an oracle providing lower bounds on the length of shortest paths between markings enables us to appeal to classical graph traversal algorithms which have been highly successful in artificial intelligence and require such oracles, namely A<sup>∗</sup> and greedy best-first search, see e.g. [52]. In particular, determining the length of shortest paths in the over-approximations described above can be phrased as optimization problems in (integer) linear programming and optimization modulo theories, for which efficient off-the-shelf solvers are available [32,7]. Thus, oracle calls can be made at comparably modest computational cost, which is crucial for the applicability of those algorithms. As a result, a large class of existing state-space over-approximations can be applied to obtain a highly efficient forward-analysis semi-decision procedure for the reachability problem. For example, in Figure 1, using the state equation as distance oracle, A<sup>∗</sup> only explores the four vertices in the blue region and directly reaches the target vertex, whereas a breadth-first search may need to explore all vertices of the figure and a depth-first search may even not terminate.

In theory, our approach could be turned into a decision procedure by applying bounds on the length of shortest paths in Petri nets [44]. However, such lengths can grow non-elementarily in the number of places [13], and just computing the cut-off length will already be infeasible for any Petri net of practical relevance. It is worth mentioning that, in practice, it has been observed that the over-approximations we employ also often witness non-reachability though, see e.g. [23]. Still, when dealing with finite state spaces, our procedure is complete.

A noteworthy benefit of our approach is that it enables finding shortest paths when A<sup>∗</sup> is used as the underlying algorithm. In program analysis, paths usually correspond to traces reaching an erroneous configuration. In this setting, shorter error traces are preferred as they help understanding why a certain error occurs. Furthermore, in program synthesis, paths correspond to synthesis plans. Again, shorter paths are preferred as they yield shorter synthesized programs. In fact, we develop our algorithmic framework for weighted Petri nets in which transitions are weighted with positive integers. Classical Petri nets correspond to the special instance where all weights are equal to one. Weighted Petri nets are useful to reflect cost or preferences in synthesis tasks. For example, there are program synthesis approaches where software projects are mined to determine how often API methods are called to guide a procedure by preferring more frequent methods [27,26,46]. Similarity metrics can also be used to obtain costs estimating the relevance of invoking methods [24]. It has further been argued that weighted Petri nets are a good model for synthesis tasks of chemical reactions as they can reflect costs of various chemical compounds [58]. Finally, weights can be viewed as representing an amount of time it takes to fire a transition, see e.g. [50].

Related work. Our approach falls under the umbrella term directed model checking coined in the early 2000s, which refers to a set of techniques to tackle the state-explosion problem via guided state-space exploration. It primarily targets disproving safety properties by quickly finding a path to an error state without the need to explicitly construct the whole state space. As such, directed model checking is useful for bug-finding since, in the words of Yang and Dill [60], in practice, model checkers are most useful when they find bugs, not when they prove a property. The survey paper [20] gives an overview over various directed model checking techniques for finite-state systems.

For Petri nets, directed reachability algorithms based on over-approximations as developed in this work have not been described. In [56], it is argued that exploration heuristics, like A∗, can be useful for Petri nets, but they do not consider over-approximations for the underlying heuristic functions. The authors of [36] use Petri nets for scheduling problems and employ the state equation, viewed as a system of linear equations over Q, in order to explore and prune reachability graphs. This approach is, however, not guaranteed to discover shortest paths. There has been further work on using A<sup>∗</sup> for exploring the reachability graph of Petri nets for scheduling problems, see, e.g., [42,48] and the references therein.

#### **2 Preliminaries**

Let <sup>N</sup> := {0, <sup>1</sup>,...}. For all <sup>D</sup> <sup>⊆</sup> <sup>Q</sup> and ∈ {≥, >}, let <sup>D</sup><sup>0</sup> := {<sup>a</sup> <sup>∈</sup> <sup>D</sup> : <sup>a</sup> <sup>0</sup>}, and for every set <sup>X</sup>, let <sup>D</sup><sup>X</sup> denote the set of vectors <sup>D</sup><sup>X</sup> := {*<sup>v</sup>* <sup>|</sup> *<sup>v</sup>* : <sup>X</sup> <sup>→</sup> <sup>D</sup>}. We naturally extend operations componentwise. In particular, (*u* + *v*)(x) := *u*(x) + *v*(x) for every x ∈ X, and *u* ≥ *v* iff *u*(x) ≥ *v*(x) for every x ∈ X.

Graphs. A (labeled directed) graph is a triple G = (V,E,A), where V is a set of nodes, A is a finite set of elements called actions, and E ⊆ V × A × V is the set of edges labeled by actions. We say that G has finite out-degree if the set of outgoing edges {(w, a, w ) ∈ E : w = v} is finite for every v ∈ V . Similarly, it has finite in-degree if the set of ingoing edges is finite for every v ∈ V . If G has both finite out- and in-degree, then we say that G is locally finite. A path π is a finite sequence of nodes (vi)<sup>1</sup>≤i≤<sup>n</sup> and actions (ai)<sup>1</sup>≤i<n such that (vi, ai, v<sup>i</sup>+1) <sup>∈</sup> <sup>E</sup> for all 1 ≤ i<n. We say that π is a path from v to w (or a v-w path) if v = v<sup>1</sup> and w = vn, and its label is a1a<sup>2</sup> ··· a<sup>n</sup>−<sup>1</sup>, where ε denotes the empty sequence.

A weighted graph is a tuple G = (V, E, A, μ) where (V,E,A) is a graph with a weight function <sup>μ</sup>: <sup>E</sup> <sup>→</sup> <sup>Q</sup>>0. The weight of path <sup>π</sup> is the weight of its edges, i.e. μ(π) := - <sup>1</sup>≤i<n <sup>μ</sup>(vi, ai, v<sup>i</sup>+1). A shortest path from <sup>v</sup> to <sup>w</sup> is a <sup>v</sup>-<sup>w</sup> path <sup>π</sup> minimizing <sup>μ</sup>(π). We define dist<sup>G</sup> : <sup>V</sup> <sup>×</sup> <sup>V</sup> <sup>→</sup> <sup>Q</sup>≥<sup>0</sup> ∪ {∞} as the distance function where distG(v, w) is the weight of a shortest path from v to w, with distG(v, w) := ∞ if there is none. We assume throughout the paper that weighted graphs have a minimal weight, i.e. that min{μ(e) : e ∈ E} exists. For graphs with finite out-degree, this ensures that if a path exists between two nodes, then a shortest one exists.<sup>4</sup> This mild assumption always holds in our setting.

Petri nets. A weighted Petri net is a tuple N = (P, T, f, λ) where


<sup>A</sup> marking is a vector *<sup>m</sup>* <sup>∈</sup> <sup>N</sup><sup>P</sup> which indicates that place <sup>p</sup> holds *<sup>m</sup>*(p) tokens. A weighted Petri net with λ(t) = 1 for each t ∈ T is called a Petri net. For example, Figure 1 depicts a Petri net N with P = {p1, p2}, T = {t1, t2, t3}, f(p1, t3) = f(p1, t2) = f(t1, p1) = f(t2, p1) = f(t2, p2) = 1 (multiplicity omitted on arcs) and f(−, −) = 0 elsewhere (no arc). Moreover, N is marked with [p<sup>1</sup> : 1, p<sup>2</sup> : 0].

The guard and effect of a transition <sup>t</sup> <sup>∈</sup> <sup>T</sup> are vectors *<sup>g</sup>*<sup>t</sup> <sup>∈</sup> <sup>N</sup><sup>P</sup> and *<sup>Δ</sup>*<sup>t</sup> <sup>∈</sup> <sup>Z</sup><sup>P</sup> where *g*t(p) := f(p, t) and *Δ*t(p) := f(t, p) − f(p, t). We say that t is firable from marking *m* if *m* ≥ *g*t. If t is firable from *m*, then it may be fired, which leads to marking *m* := *m* + *Δ*t. We write this as *m* <sup>t</sup> −→<sup>N</sup> *m* . These notions naturally extend to sequences of transitions, i.e. <sup>ε</sup> −→<sup>N</sup> denotes the identity relation over <sup>N</sup><sup>P</sup> , *<sup>Δ</sup>*<sup>ε</sup> := **<sup>0</sup>**, <sup>λ</sup>(ε) := 0, and for every <sup>t</sup>1, t2,...,t<sup>k</sup> <sup>∈</sup> <sup>T</sup>: *<sup>Δ</sup>*<sup>t</sup>1t2···t<sup>k</sup> := *Δ*<sup>t</sup><sup>1</sup> + *Δ*<sup>t</sup><sup>2</sup> + ··· + *Δ*<sup>t</sup><sup>k</sup> , λ(t1t<sup>2</sup> ···tk) := λ(t1) + λ(t2) + ··· + λ(tk), and

$$\xrightarrow{t\_1 \, t\_2 \cdots t\_k}\_{\mathcal{N}} := \xrightarrow{t\_k}\_{\mathcal{N}} \circ \cdots \circ \xrightarrow{t\_2}\_{\mathcal{N}} \circ \xrightarrow{t\_1}\_{\mathcal{N}} \dots$$

<sup>4</sup> Otherwise, there could be increasingly better paths, e.g. of weights 1, 1/2, 1/4,....

We say that −→<sup>N</sup>:= ∪<sup>t</sup>∈<sup>T</sup> t −→<sup>N</sup> and <sup>∗</sup> −→<sup>N</sup>:= ∪<sup>σ</sup>∈<sup>T</sup> <sup>∗</sup> σ −→<sup>N</sup> are the step and reachability relations. Note that the latter is the reflexive transitive closure of −→N.

For example, *<sup>m</sup>* <sup>t</sup>2t<sup>3</sup> −−→<sup>N</sup> *<sup>m</sup>* and *<sup>m</sup>* <sup>t</sup>1t2t3t<sup>3</sup> −−−−−→<sup>N</sup> *<sup>m</sup>* in Figure 1, where *<sup>m</sup>* := [p<sup>1</sup> : 1, p<sup>2</sup> : 0] and *m* := [p<sup>1</sup> : 0, p<sup>2</sup> : 1]. Moreover, t<sup>2</sup> is not firable in *m* .

Given a sequence <sup>σ</sup> <sup>∈</sup> <sup>T</sup> <sup>∗</sup>, denote by <sup>|</sup>σ|<sup>t</sup> <sup>∈</sup> <sup>N</sup> the number of times transition <sup>t</sup> occurs in <sup>σ</sup>. The Parikh image of <sup>σ</sup> is the vector *<sup>σ</sup>* <sup>∈</sup> <sup>N</sup><sup>T</sup> that captures the number of occurrences of transitions appearing in σ, i.e. *σ*(t) := |σ|<sup>t</sup> for all t ∈ T.

Each weighted Petri net N = (P, T, f, λ) induces a locally finite weighted graph <sup>G</sup>N(<sup>N</sup> ) := (V,E,T,μ), called its reachability graph, where <sup>V</sup> := <sup>N</sup><sup>P</sup> , <sup>E</sup> := {(*m*, t,*m* ) : *m* <sup>t</sup> −→<sup>N</sup> *m* } and μ(*m*, t,*m* ) := λ(t) for each (*m*, t,*m* ) ∈ E. An example of a reachability graph is given on the right of Figure 1. We write dist<sup>N</sup> to denote dist<sup>G</sup>N(N). We have dist<sup>N</sup> (*m*,*m* ) <sup>=</sup> <sup>∞</sup> iff *<sup>m</sup>* <sup>σ</sup> −→<sup>N</sup> *m* for some σ ∈ T <sup>∗</sup>, and if the latter holds, then dist<sup>N</sup> (*m*,*m* ) is the minimal weight among such firing sequences σ. Moreover, for (unweighted) Petri nets, dist<sup>N</sup> (*m*,*m* ) is the minimal number of transitions to fire to reach *m* from *m*.

#### **3 Directed Search Algorithms**

Our approach relies on classical pathfinding procedures guided by node selection strategies. Their generic scheme is described in Algorithm 1. Its termination with a value d = ∞ indicates that the weighted graph G = (V, E, A, μ) has a path from s to t of weight d, whereas termination with d = ∞ signals that distG(s, t) = ∞.

```
1 g := [s → 0, v → ∞ : v = s]
2 C := {s}
3 while C = ∅ do
4 v := arg minv∈C S(g, v)
5 if v = t then return g(t)
6 for (v, a, w) ∈ E do
7 if g(v) + μ(v, a, w) < g(w) then
8 g(w) := g(v) + μ(v, a, w)
9 C := C ∪ {w}
10 C := C \ {v}
11 return ∞
Algorithm 1: Directed search algorithm.
```
Algorithm 1 maintains a set of frontier nodes C and a mapping <sup>g</sup> : <sup>V</sup> <sup>→</sup> <sup>Q</sup>≥<sup>0</sup> ∪ {∞} such that g(w) is the weight of the best known path from s to w. In Line 4, a selection strategy S determines which node v to expand next. Starting from Line 6, a successor w of v is added to the frontier if its distance improves.

Let <sup>h</sup>: <sup>V</sup> <sup>→</sup> <sup>Q</sup>≥<sup>0</sup> ∪ {∞} estimate the distance from all nodes to a target t ∈ V . The

selection strategies sending (g, v) respectively to g(v), g(v) + h(v) or h(v) yield the classical Dijkstra's, A<sup>∗</sup> and greedy best-first search (GBFS) algorithms.

When instantiating S with Dijkstra's selection strategy, a return value d = ∞ is guaranteed to equal distG(s, t). This is not true for A<sup>∗</sup> and GBFS. However, if h fulfills the following consistency properties, then A<sup>∗</sup> also has this guarantee: h(t) = 0 and h(v) ≤ μ(v, a, w) + h(w) for every (v, a, w) ∈ E (see, e.g., [52]).

In the setting of infinite graphs, unlike GBFS, A<sup>∗</sup> and Dijkstra's selection strategies guarantee termination if distG(s, t) = ∞. Yet, we introduce unbounded heuristics for which termination is also guaranteed for GBFS. Note that these guarantees would vanish in the presence of zero weights. An infinite path π is a sequence of nodes (vi)<sup>i</sup>∈<sup>N</sup> and actions (ai)<sup>i</sup>∈<sup>N</sup> such that (vi, ai, v<sup>i</sup>+1) <sup>∈</sup> <sup>E</sup> for all <sup>i</sup> <sup>∈</sup> <sup>N</sup>. We say that heuristic <sup>h</sup> is unbounded (w.r.t. <sup>G</sup>) if for every infinite simple path <sup>v</sup>0, v1, v2,... of <sup>G</sup> and for every <sup>b</sup> <sup>∈</sup> <sup>Q</sup>≥0, there exists an index <sup>i</sup> s.t. h(vi) ≥ b. In other words, unboundedness forbids an infinite simple path of G to "cap" at some distance estimate b. The following technical lemma enables to prove termination of GFBS in the presence of unbounded heuristics.

**Lemma 1.** If G is locally finite, then the following holds:


3. No node is expanded infinitely often by Algorithm 1.

**Theorem 1.** Algorithm 1 with the greedy best-first search selection strategy always finds reachable targets for locally finite graphs and unbounded heuristics.

Proof. First observe that Algorithm 1 satisfies this invariant:

if g(v) = ∞, then g(v) is the weight of a path from s to v in G whose nodes were all expanded, except possibly v. (∗)

Assume distG(s, t) = ∞. For the sake of contradiction, suppose t is never expanded. Let K<sup>i</sup> be the subgraph of G induced by nodes expanded at least once within the first i iterations of the **while** loop. In particular, K<sup>1</sup> is the graph made only of node s. Let K = K<sup>1</sup> ∪ K<sup>2</sup> ∪··· . By Lemma 1 (3), no node is expanded infinitely often, hence K is infinite. Moreover, K has finite out-degree, and each node of K is reachable from s in K by (∗). Thus, by K¨onig's lemma, K contains an infinite path v0, v1,... ∈ V of pairwise distinct nodes.

Let w be a node of K minimizing distG(w, t). That minimum is well-defined by Lemma 1 (2). Since s ∈ K<sup>1</sup> ⊆ K and t is reachable from s, we have distG(w, t) ≤ distG(s, t) < ∞. By minimality of w = t, there exists an edge (w, a, w ) of G such that distG(w , t) < distG(w, t) and w does not appear in K. Note that w is added to C at some point, but is never expanded as it would otherwise belong to K. Let i be the smallest index such that w belongs to Ki. Since h is unbounded, there exists j such that h(v<sup>j</sup> ) > h(w ) and v<sup>j</sup> is expanded after iteration i of the while loop. This is a contradiction as w would have been expanded instead of v<sup>j</sup> .

#### **4 Directed Reachability**

In this section, we explain how to instantiate Algorithm 1 for finding short(est) firing sequences witnessing reachability in weighted Petri nets. Since Dijkstra's selection strategy does not require any heuristic, we focus on A<sup>∗</sup> and greedy bestfirst search which require consistent and unbounded heuristics. More precisely, we introduce distance under-approximations (Section 4.1); present relevant concrete distance under-approximations (Section 4.2); and put everything together into our framework (Section 4.3).

#### **4.1 Distance Under-approximations**

A distance under-approximation of a weighted Petri net N = (P, T, f, λ) is a function <sup>d</sup>: <sup>N</sup><sup>P</sup> <sup>×</sup> <sup>N</sup><sup>P</sup> <sup>→</sup> <sup>Q</sup>≥<sup>0</sup> ∪ {∞} such that for all *<sup>m</sup>*,*m* ,*m* <sup>∈</sup> <sup>N</sup><sup>P</sup> :

**–** d(*m*,*m* ) ≤ dist<sup>N</sup> (*m*,*m* ), **–** d(*m*,*m*) ≤ d(*m*,*m* ) + d(*m* ,*m*) (triangle inequality), and **–** d is effective, i.e. there is an algorithm that evaluates d on all inputs.

We naturally obtain a heuristic from d for a directed search towards marking *<sup>m</sup>*target. Indeed, let <sup>h</sup>: <sup>N</sup><sup>P</sup> <sup>→</sup> <sup>Q</sup>≥<sup>0</sup> ∪ {∞} be defined by <sup>h</sup>(*m*) := <sup>d</sup>(*m*,*m*target). The following proposition shows that h is a suitable heuristic for A∗:

**Proposition 1.** Mapping h is a consistent heuristic.

Proof. Let *<sup>m</sup>*,*m* <sup>∈</sup> <sup>N</sup><sup>P</sup> and <sup>t</sup> <sup>∈</sup> <sup>T</sup> be such that *<sup>m</sup>* <sup>t</sup> −→<sup>N</sup> *m* . We have:


Moreover, h(*m*target) = d(*m*target,*m*target) ≤ dist<sup>N</sup> (*m*target,*m*target) = 0, where the last equality follows from the fact that weights are positive.

#### **4.2 From Petri Net Relaxations to Distance Under-approximations**

We now introduce classical relaxations of Petri nets which over-approximate reachability and consequently give rise to distance under-approximations. The main source of hardness of the reachability problem stems from the fact that places are required to hold a non-negative number of tokens. If we relax this requirement and allow negative numbers of tokens, we obtain a more tractable relation. More precisely, we write *m* <sup>t</sup> −→<sup>Z</sup> *m* iff *m* = *m*+*Δ*t. Note that transitions are always firable under this semantics. Moreover, they may lead to "markings" with negative components.

Another source of hardness comes from the fact that markings are discrete. Hence, we can further relax −→<sup>Z</sup> into −→<sup>Q</sup> where transitions may be scaled down:

$$m \xrightarrow[]{t}\_{\mathbb{Q}} m' \iff m' = m + \delta \cdot \Delta\_t \text{ for some } 0 < \delta \le 1.$$

One gets a less crude relaxation from considering nonnegative "markings" only:

$$m \xrightarrow{t}\_{\mathbb{Q}\_{\ge 0}} m' \iff (m \ge \delta \cdot g\_t) \text{ and } (m' = m + \delta \cdot \Delta\_t) \text{ for some } 0 < \delta \le 1.$$

Under these, we obtain "markings" from Q<sup>P</sup> and Q<sup>P</sup> <sup>≥</sup><sup>0</sup> respectively. Petri nets equipped with relation −→<sup>Q</sup>≥<sup>0</sup> are known as continuous Petri nets [14,15].

To unify all three relaxations, we sometimes write *m* δt −→<sup>G</sup> *m* to emphasize the scaling factor <sup>δ</sup>, where <sup>δ</sup> = 1 whenever <sup>G</sup> <sup>=</sup> <sup>Z</sup>. Let <sup>d</sup><sup>G</sup> : <sup>N</sup><sup>P</sup> <sup>×</sup>N<sup>P</sup> <sup>→</sup> <sup>Q</sup>≥0∪{∞} be defined as dG(*m*,*m* ) := ∞ if *m* ∗ −→<sup>G</sup> *m* , and otherwise:

$$d\_{\mathbb{G}}(m, m') := \min \left\{ \sum\_{i=1}^{n} \delta\_i \cdot \lambda(t\_i) : m \xrightarrow{\delta\_1 t\_1 \cdots \delta\_n t\_n} \_{\mathbb{G}} m' \right\}.$$

In words, dG(*m*,*m* ) is the weight of a shortest path from *m* to *m* in the graph induced by the relaxed step relation −→<sup>G</sup>, where weights are scaled accordingly.

We now show that any dG, which we call the G-distance, is a distance underapproximation, and first show effectiveness of all dG. It is well-known and readily seen that reachability over <sup>G</sup> ∈ {Z, <sup>Q</sup>} is characterized by the following state equation, since transitions are always firable due to the absence of guards:

$$\operatorname{im} \xrightarrow{\ast}\_{\mathbb{G}} m' \iff \exists \sigma \in \mathbb{G}\_{\geq 0}^{T} : m' = m + \sum\_{t \in T} \sigma(t) \cdot \Delta\_{t}.$$

Here, *σ* can be seen as the Parikh image of a sequence σ leading from *m* to *m* .

**Proposition 2.** The functions dZ, dQ, d<sup>Q</sup>≥<sup>0</sup> are effective.

Proof. By the state equation, we have:

$$d\_{\mathbb{G}}(m, m') = \min \left\{ \sum\_{t \in T} \lambda(t) \cdot \sigma(t) : \sigma \in \mathbb{G}\_{\geq 0}^{T}, m' = m + \sum\_{t \in T} \sigma(t) \cdot \Delta\_{t} \right\}.$$

Therefore, dQ(*m*,*m* ) (resp. dZ(*m*,*m* )) are computable by (resp. integer) linear programming, which is complete for P (resp. NP), in its variant where one must check whether the minimal solution is at most some bound.

For d<sup>Q</sup>≥<sup>0</sup> , note that the reachability relation of a continuous Petri net can be expressed in the existential fragment of linear real arithmetic [8]. Hence, effectiveness follows from the decidability of linear real arithmetic.

Altogether, we conclude that d<sup>G</sup> is a distance under-approximation. Furthermore, we can show that d<sup>G</sup> yields unbounded heuristics, which, by Theorem 1, ensure termination of GBFS on reachable instances:

**Theorem 2.** Let <sup>G</sup> ∈ {Z, <sup>Q</sup>, <sup>Q</sup>≥<sup>0</sup>}, then <sup>d</sup><sup>G</sup> is a distance under-approximation. Moreover, the heuristics arising from it are unbounded.

Proof. Let N = (P, T, f, λ) be a weighted Petri net. Effectiveness of d<sup>G</sup> follows from Proposition 2. By definitions and a simple induction, <sup>σ</sup> −→<sup>N</sup> <sup>⊆</sup> <sup>σ</sup> −→<sup>G</sup> for any sequence σ ∈ T <sup>∗</sup>, with weights left unchanged for unscaled transitions. This implies that dG(*m*,*m* ) ≤ dist<sup>N</sup> (*m*,*m* ) for every *<sup>m</sup>*,*m* <sup>∈</sup> <sup>G</sup><sup>P</sup> . Moreover, the triangle inequality holds since for every *m*,*m* ,*m* <sup>∈</sup> <sup>G</sup><sup>P</sup> and sequences σ, σ :

$$m \xrightarrow{\sigma}\_{\mathbb{G}} m' \xrightarrow{\sigma'}\_{\mathbb{G}} m'' \text{ implies } m \xrightarrow{\sigma \sigma'}\_{\mathbb{G}} m''.$$

Let us sketch the proof of the second part. Let *m*target be a marking and let h<sup>G</sup> be the heuristic obtained from d<sup>G</sup> for *m*target. Since hQ(*m*) ≤ hG(*m*) for all *<sup>m</sup>* and <sup>G</sup> ∈ {Z, <sup>Q</sup>≥0}, it suffices to prove that <sup>d</sup><sup>Q</sup> is unbounded. Suppose it is not. There exist <sup>b</sup> <sup>∈</sup> <sup>Q</sup>≥<sup>0</sup> and pairwise distinct markings *<sup>m</sup>*0,*m*1,... each with hQ(*m*i) ≤ b. Let *x*<sup>i</sup> be a solution to the state equation that gives hQ(*m*i). By well-quasi-ordering and pairwise distinctness, there is a subsequence such that *m*<sup>i</sup><sup>0</sup> (p) < *m*<sup>i</sup><sup>1</sup> (p) < ··· for some p ∈ P. Thus, lim<sup>j</sup>→∞ *m*target(p) − *m*<sup>i</sup><sup>j</sup> (p) = −∞, and hence lim<sup>j</sup>→∞ *x*<sup>i</sup><sup>j</sup> (s) = ∞ for some s ∈ T with *Δ*s(p) < 0. This means that b ≥ hQ(*m*<sup>i</sup><sup>j</sup> ) = - <sup>t</sup>∈<sup>T</sup> <sup>λ</sup>(t) · *<sup>x</sup>*<sup>i</sup><sup>j</sup> (t) > b for a sufficiently large <sup>j</sup>.

#### **4.3 Directed Reachability Based on Distance Under-approximations**

We have all the ingredients to use Algorithm 1 for answering reachability queries.

A distance under-approximation scheme is a mapping D that associates a distance under-approximation D(N ) to each weighted Petri net N . Let hD(N),*m*target be the heuristic obtained from D(N ) for marking *m*target. By instantiating Algorithm 1 with this heuristic, we can search for a short(est) firing sequence witnessing that *m*target is reachable. Of course, constructing the reachability graph of N would be at least as difficult as answering this query, or impossible if it is infinite. Hence, we provide GN(N ) symbolically through N and let Algorithm 1 explore it on-the-fly by progressively firing its transitions.

For each <sup>G</sup> ∈ {Z, <sup>Q</sup>, <sup>Q</sup>≥<sup>0</sup>}, the function <sup>D</sup><sup>G</sup> mapping a weighted Petri net <sup>N</sup> to its G-distance d<sup>G</sup> is a distance under-approximation scheme with consistent and unbounded heuristics by Proposition 1, Theorem 1 and Theorem 2. Although Algorithm 1 is geared towards finding paths, it can prove non-reachability even for infinite reachability graphs. Indeed, at some point, every candidate marking *m* ∈ C may be such that hD(N),*m*target (*m*) = ∞, which halts with ∞. There is no guarantee that this happens, but, as reported e.g. by [23,8], the G-distance for domains <sup>G</sup> ∈ {Z, <sup>Q</sup>, <sup>Q</sup>≥<sup>0</sup>} does well for witnessing non-reachability in practice, often from the very first marking *m*init.

An example. We illustrate our approach with a toy example and D<sup>Q</sup> (the scheme based on the state equation over Q<sup>T</sup> <sup>≥</sup><sup>0</sup>). Consider the Petri net <sup>N</sup> illustrated on the left of Figure 1, but marked with *m*init := [p<sup>1</sup> : 0, p<sup>2</sup> : 0]. Suppose we wish to determine whether *m*init can reach marking *m*target := [p<sup>1</sup> : 0, p<sup>2</sup> : 1] in N .

We consider the case where Algorithm 1 follows a greedy best-first search, but the markings would be expanded in the same way with A∗. Let us abbreviate a marking [p<sup>1</sup> : x, p<sup>2</sup> : y] as (x, y). Since *Δ*<sup>t</sup><sup>2</sup> = (0, 1), the heuristic considers that *m*init can reach *m*target in a single step using transition t<sup>2</sup> (it is unaware of the guard). Marking (1, 0) is expanded and its heuristic value increases to 2 as the state equation considers that both t<sup>2</sup> and t<sup>3</sup> must be fired (in some unknown order). Markings (2, 0) and (1, 1) are both discovered with respective heuristic values 3 and 1. The latter is more promising, so it is expanded and target (0, 1) is discovered. Since its heuristic value is 0, it is immediately expanded and the correct distance dist<sup>N</sup> (*m*init,*m*target) = 3 is returned. Note that, in this example, the only markings expanded are precisely those occurring on the shortest path. Handling multiple targets. Algorithm 1 can be adapted to search for some marking from a given target set <sup>X</sup> <sup>⊆</sup> <sup>N</sup><sup>P</sup> . The idea consists simply in using a heuristic <sup>h</sup><sup>X</sup> : <sup>N</sup><sup>P</sup> <sup>→</sup> <sup>Q</sup>≥<sup>0</sup> ∪ {∞} estimating the weight of a shortest path to any target:

$$h\_X(m) := \min\{h\_{\mathcal{D}(\mathcal{N}), m\_{\text{target}}}(m) : m\_{\text{target}} \in X\}.$$

This is convenient for partial reachability instances occurring in practice, i.e.

$$X := \left\{ \mathbf{m}\_{\text{target}} \in \mathbb{N}^P \colon \mathbf{m}\_{\text{target}}(p) \sim\_p \mathbf{c}(p) \right\} \text{ where } \mathbf{c} \in \mathbb{N}^P \text{ and } each \sim\_p \mathbf{c} \in \{=, \ge\} .$$

#### **5 Experimental Results**

We implemented Algorithm 1 in a prototype called FastForward [10], which supports all presented selection strategies and distance under-approximations. We evaluate FastForward empirically with three main goals in mind. First, we show that our approach is competitive with established tools and can even vastly outperform them, and we also give insights on its performance w.r.t. its parameterizations. Second, we compare the length of the witnesses reported by the different tools. Third, we briefly discuss the quality of the heuristics.

**Technical details.** Our tool is written in C# and uses Gurobi [32], a state-ofthe-art MILP solver, for distance under-approximations. Benchmarks were run on an machine with an 8-Core Intel<sup>R</sup> CoreTM i7-7700 CPU @ 3.60GHz running Ubuntu 18.04 and with memory constrained to ∼8GB. We used a timeout of 60 seconds per instance, and all tools were invoked from a Python script using the time module for time measurements.

A minor challenge arises from the fact that many instances specify an upwardclosed set of initial markings rather than a single one. For example, *m*init(p) ≥ 1 to specify, e.g., an arbitrary number of threads. We handle this by setting *m*init(p) = 1 and adding a transition t<sup>p</sup> producing a token into p.

As a preprocessing step, we implemented sign analysis [29]. It is a general pruning technique running in polynomial time that has been shown beneficial for reducing the size of the state-space of Petri nets. Initially, places that carry tokens are viewed as marked. For each transition whose input places are marked, the output places also become marked. When a fixpoint is reached, places left unmarked cannot carry tokens in any reachable marking, so they are discarded.

**Benchmarks.** Due to the lack of tools handling reachability for unbounded state spaces, benchmarks arising in the literature are primarily coverability instances<sup>5</sup>, i.e. reachability towards an upward closed set of target markings. We gathered 61 positive and 115 negative coverability instances originating from five suites [39,28,6,35,18] previously used for benchmarking [23,8,29]. They arise from the analysis of multi-threaded C programs with shared-memory; mutual

<sup>5</sup> The Model Checking Contest focuses on reachability for finite state spaces.

exclusion algorithms; communication protocols; provenance analysis in the context of a medical messaging and a bug-tracking system; and the verification of Erlang concurrent programs. We further extracted the sypet suite made of 30 positive (standard) reachability instances arising from queries encountered in type-directed program synthesis [24]. The overall goal of this work is to enable a vast range of untapped applications requiring reachability over unbounded state-spaces, rather than just coverability. To obtain further (positive) instances of the Petri net reachability problem, we performed random walks on the Petri nets from the aforementioned coverability benchmarks. To this end, we used the largest quarter of distinct Petri nets from each coverability suite, for a total of 33. We performed one random walk each of lengths 20, 25, 30, 35, 40, 50, 60, 75, 90 and 100, and we saved the resulting marking as the target. For nets with an upward-closed initial marking, we randomly chose to start with a number of tokens between 1 and 20% of the length of the walk. It is important to note that even with long random walks, instances can (and in fact tend to) have short witnesses. To remove trivial instances and only keep the most challenging ones, we removed those instances where any considered tool reported a witness of length at most 20, disregarding the transitions used to generate the initial marking. This leaves us with 127 challenging instances on which the shortest witness is either unknown or has length more than 20. Moreover, this yields real-world Petri nets with no bias towards any specific kind of targets.


This table summarizes the characteristics of the various benchmarks:

**Tool comparison.** To evaluate our approach on reachability instances, we compare FastForward to LoLA [53], a tool developed for two decades that wins several categories of the Model Checking Contest every year. LoLA is geared towards model checking of finite state spaces, but it implements semi-decision procedures for the unbounded case. We further compare the three selection strategies of Algorithm 1: A∗, GBFS and Dijkstra; the two first with the distance under-approximation scheme D<sup>Q</sup>, which provides the best trade-off between estimate quality and efficiency. In fact, the other heuristics perform strictly worse on almost all instances. We also considered comparing with KReach [17], a tool showcased at TACAS'20 that implements an exact non-elementary algorithm. However, it timed out on all instances with a larger time limit of 10 minutes.

Figure 2 depicts the number of reachability instances decided by the tools within the time limit. As shown, all approaches outperform LoLA, with GBFS as the clear winner on the random-walk suite and A<sup>∗</sup> slightly better on the sypet suite. Note that Dijkstra's selection strategy sometimes competes due

**Fig. 2.** Cumulative number of reachability instances decided over time. Left: sypet suite (semi-log scale). Right: random-walk suite (log scale).

to its locally very cheap computational cost (no heuristic evaluation), but its performance generally decreases as the distance increases.

To show the versatility of our approach, we also benchmarked FastForward on the original coverability instances. Recall that coverability EXPSPACEcomplete and reduces to reachability in linear time [45,51]. While exceeding the PSPACE-completeness of reachability for finite state-spaces [38,21], coverability is much more tame than the non-elementary complexity of (unbounded) reachability. We compare FastForward to four tools implementing algorithms tailored, some of which are specifically to the coverability problem: LoLA, Bfc [39], ICover [29] and the backward algorithm (based on [1]) of mist [28]. We did not test Petrinizer [23] since it only handles negative instances, while we focus on positive ones; likewise for QCover [8] since it is superseded by ICover.

**Fig. 3.** Cumulative number of (positive) coverability instances decided over time. Left: Evaluation on the original instances. Right: Evaluation on the pre-pruned instances.

Figure 3 illustrates the number of coverability instances decided within the time limit. The left side corresponds to an evaluation on the original instances where FastForward performs pruning (included in its runtime). On the righthand side the pruned instances are the input for all tools, and the time for this pruning is not included for any tool. As a caveat, ICover performs its own preprocessing which includes pruning among techniques specific to coverability. This preprocessing is enabled (and its time is included) even when pruning is already done. Using FastForward(A∗, <sup>D</sup>Q), we decide more instances than all tools on unpruned Petri nets, and one less than Bfc for pre-pruned instances. It is worth mentioning that with a time limit of 10 minutes per instance, FastForward(A∗, <sup>D</sup>Q) is the only tool to decide all 61 instances.

**Fig. 4.** Runtime comparison against FF(A∗, <sup>D</sup><sup>Q</sup>) (left) and FF(GBFS, <sup>D</sup><sup>Q</sup>) (right), in seconds, for individual instances without pre-pruning. Tools on the first column of each side include coverability and reachability instances, while those on the second column of each side include coverability only. Marks on the green lines denote timeouts (60 s).

We also compared the running time of A<sup>∗</sup> and GBFS with D<sup>Q</sup> to the other tools and approaches. For each tool, we considered the type of instances it can handle: either reachability and coverability, or coverability only. Figure 4 depicts this comparison, where the base approach is faster for data points that lie in the upper-left half of the graph. The axes start at 0.1 second to avoid a comparison based on technical aspects such as the programming language. Yet, LoLA, Bfc and mist regularly solve instances faster than this, which speaks to their level of optimization. We can see that FastForward outperforms ICover, LoLA and mist overall. We cannot compete with Bfc in execution time as it is a highly optimized tool specifically tailored to only the coverability problem that can employ optimization techniques such as Karp-Miller trees that do not work for reachability queries.

**Length of the witnesses.** Since our approach is also geared towards the identification of short(est) reachability witnesses, we compared the different tools with respect to length of the reported one, depicted in Figure 5. Positive values on the y-axis mean the witness was not minimal, while y = 0 means it was. Note that the points for Bfc must be taken with a grain of salt: it uses a different file format, and its translation utility can introduce additional transitions. This means that even if Bfc found a shortest witness, it could be longer than a shortest one of the original instance.

**Fig. 5.** Length of the returned witness, per tool, compared to the length of a shortest witness. ICover is left out as it does not return witnesses. FF(A∗, <sup>D</sup><sup>Q</sup>), FF(Dijkstra) and mist are left out as they are guaranteed to return shortest witnesses.

Still, the graph shows that reported witnesses can be far from minimal. For example, on one instance LoLA returns a witness that is 53 transitions longer than the one of FastForward(A∗, <sup>D</sup><sup>Q</sup>). Still, LoLA returns a shortest witness on 28 out of 43 instances. Similarly, FastForward(GBFS, <sup>D</sup><sup>Q</sup>) finds a shortest path on 60 out of 83 instances<sup>6</sup>. In contrast, mist finds a shortest witness on all instances since its backward algorithm is guaranteed to do so on unweighted Petri nets, which constitute all of our instances. Again, this approach is tailored to coverability and cannot be lifted to reachability.

**Heuristics and pruning.** We briefly discuss the quality of the heuristics and the impact of pruning. The left-hand side of Figure 6 compares the exact distance to the estimated distance from the initial marking.<sup>7</sup> It shows that it is incredibly accurate for all <sup>G</sup>-distances, but even more so for <sup>G</sup> <sup>=</sup> <sup>Q</sup>≥<sup>0</sup>. We experimented with this distance using the logical translation of [8] and Z3 [49] as the optimization modulo theories solver. At present, it appears that the gain in estimate quality does not compensate for the extra computational cost.

As depicted on the right-hand side of Figure 6, pruning can make some instances trivial, but in general, many challenging instances remain so. On average, around 50% of places and 40% of transitions were pruned.

<sup>6</sup> These numbers disregard instances where the tool did not finish or where a shortest witness is not known, i.e. no method guaranteeing one finished in time.

<sup>7</sup> Z3 reported two non optimal solutions which explains the two points above the line.

**Fig. 6.** Left: initial distance estimation compared to the exact distance (points closer to the diagonal are better). Right: number of instances per percentage of places (left) and transitions (right) removed by pruning (rounded to nearest multiple of 10).

#### **6 Conclusion**

We presented an efficient approach to the Petri net reachability problem that uses state-space over-approximations as distance oracles in the classical graph traversal algorithms A<sup>∗</sup> and greedy best-first search. Our experiments have shown that using the state equation over Q<sup>T</sup> <sup>≥</sup><sup>0</sup> provides the best trade-off between computational feasibility and the accuracy of the oracle. However, we expect that further advances in optimization modulo theories solvers may enable employing stronger over-approximations such as continuous Petri nets in the future.

Moreover, non-algebraic distance under-approximations also fit naturally in our framework, e.g. the syntactic distance of [55] and "α-graphs" of [24]. These are crude approximations with low computational cost. Our preliminary tests show that, although they could not compete with our distances, they can provide early speed-ups on instances with large branching factors. An interesting line of research consists in identifying cheap approximations with better estimates.

We wish to emphasize that our approach to the reachability problem has the potential to also be naturally used for semi-deciding reachability in extensions of Petri nets with a recursively enumerable reachability problem, such as Petri nets with resets and transfers [3,19] as well as colored Petri nets [37]. These extensions have, for instance, been used for the generation of program loop invariants [54], the validation of business processes [59] and the verification of multi-threaded C and Java program skeletons with communication primitives [16,39]. Linear rational and integer arithmetic over-approximations for such extended Petri nets exist [12,9,34,31] and could smoothly be used inside our framework.

#### **Acknowledgments**

We thank Juliette Fournis d'Albiat for her help with extracting the sypet suite.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Bridging Arrays and ADTs in Recursive Proofs

Grigory Fedyukovich1(-) and Gidon Ernst<sup>2</sup>

<sup>1</sup> Florida State University, Tallahassee, USA, grigory@cs.fsu.edu <sup>2</sup> Ludwig-Maximilians-University, Munich, Germany, gidon.ernst@lmu.de

Abstract. We present an approach to synthesize relational invariants to prove equivalences between object-oriented programs. The approach bridges the gap between recursive data types and arrays that serve to represent internal states. Our relational invariants are recursively-defined, and thus are valid for data structures of unbounded size. Based on introducing recursion into the proofs by observing and lifting the constraints from joint methods of the two objects, our approach is fully automatic and can be seen as an algorithm for solving Constrained Horn Clauses (CHC) of a specific sort. It has been implemented on top of the SMTbased CHC solver AdtChc and evaluated on a range of benchmarks.

#### 1 Introduction

Relational verification is widely applicable during an iterative process of software development, when a high-level specification, a prototype implementation, or even an arbitrary previous version is compared to the current version and verified for the absence of newly introduced bugs. As software grows large, *compositionality* becomes a crucial factor to achieve scalability of relational verification tasks: reasoning about pairs of entire programs is reduced to reasoning about pairs of modules or isolated components of code. Proofs found for one component can be reused while reasoning about another component, or even the system in a whole. Successful examples in large-scale verification projects include a step-wise refinement in seL4 [30] and the integration of model checking to software development workflow in AWS C Common [11].

In this work, we represent relational verification problems over *object-oriented programs* as Constrained Horn Clauses (CHC). A CHC is an implication in firstorder logic that involves a set of unknown predicates. For a system of CHCs, we wish to find an interpretation for all predicates that validates all implications. CHCs are used in various tasks appearing in verification, e.g., finding loop invariants or function summaries. For relational verification, a system of CHCs can be constructed by pairing components of code of two versions in lockstep and supplying it with relational pre- and post-conditions [14, 39, 44, 53]. Stateof-the-art tools for solving CHC, e.g., [9,19,21,27,32], are based on Satisfiability Modulo Theories (SMT), e.g., [40, 47], they gradually become more robust, as long as the programs under analysis do not have a *mixed use of data structures*.

Verification conditions of real-world problems involve data structures such as arrays and Algebraic Data Types (ADTs) of unknown size, expecting the

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 24–42, 2021.

https://doi.org/10.1007/978-3-030-72013-1\_2

proofs to capture (quantified or recursive) properties over countably infinite sets of elements. Arrays are being handled in loops and often require finding universally-quantified loop invariants [21]. ADTs, such as lists, maps, and sets, require reasoning by structural induction [47] and often rely on additional helper lemmas which are difficult to be synthesized automatically. For relational verification tasks, where one program is over arrays, and another is over ADTs, the solvers should likely reason over quantified formulas and induction *at the same time*, which is currently challenging for most of the automated tools.

We propose a set of new algorithms for solving CHCs constructed by pairing programs over arrays and ADTs. Because we deal with object-oriented programs, the data structures might be accessed and modified in any given method, and our pairing is done for each method separately. Relational proofs are synthesized over the data structures – they describe a relation that holds while *simultaneously traversing pairs of elements* by any of the methods. Our key idea is that not all methods may be needed for the actual synthesis. In fact, our algorithm generates a candidate proof by bridging a single pair of methods and then validates/repairs it on all others. In essence, we observe how pairs of inputs (or pairs of outputs) change the states, guess a candidate relation between elements of states, and (dis-)prove it on all other methods using an SMT-based theorem prover.

Our synthesis strategy is customized for different classes of benchmarks via so called *recipes*. We present two recipes for the list ADT that are applicable, respectively, for (1) stacks and queues, and (2) sets, multisets, and maps. They both discover nontrivial invariants that need a *recursive interpretation*. We independently generate its base and recursive cases. The key point in determining the relations is to automatically investigate how an input or an output affects the state. Finally, we discover auxiliary lemmas that provide additional properties about objects in isolation and help proving the inferred invariants are valid.

Importantly, in contrast to a more lightweight CHC setting over numerical theories (and even arrays) that can rely on an SMT solver to validate its recursion-free solutions, the validation of our *recursive solutions* is conducted by structural induction. We thus rely on recent advances in SMT-based *fully automated* theorem proving [55] that (since recently) supports arrays. The experiments have shown that the approach is reasonably fast in practice. Our contribution, while presented in the CHC context, can be lifted on the program analysis context and implemented in a range of robust verification tools that are designed to support compositionality [7, 24].

The rest of the paper is structured as follows. A short outline on background and notation is given in Sect. 2. In Sect. 3, we give an overview of the approach. Then, Sect. 4 and Sect. 5 present our recipes. Finally, we give the evaluation details in Sect. 6, related work in Sect. 7, and conclude the paper in Sect. 8.

#### 2 Preliminaries

An *object* O = (St,Init,(Opn)<sup>n</sup>∈[1,N]) is defined over internal states St, with initialization Init(s) denoting initial states s, and methods Opn, also called operations, for some identifier n (which for simplicity is treated as a natural number in some finite interval, but later sections liberally refer to Op<sup>n</sup> by their name). Each operation Opn(in, s, s , out) defines transitions between a pair of states s and s for a given input in, producing an output out. Moreover, each operation has an associated precondition pren(in, s), ranging over the input and pre-state.

In this paper, we take a syntactic approach by representing states as tuples of variables. Specifically, we assume that Init(s) and each operation Op(in, s, s , out) is given as a predicate, i.e., as a characteristic formula, over the specified parameters, that holds for initial states, respectively, when the program can take a particular transition. Such a formula can be obtained from the source code by symbolic execution, and we assume that effect of loops inside operations is captured by quantified formulas, creation of which is an orthogonal problem. Hence, our approach is language agnostic.

We assume that the programs under consideration are deterministic, and we assume that pre(in, s) =⇒ ∃s , out. Opn(in, s, s , out). Note that for deterministic programs, the existential quantifier in ∃s , out. Opn(in, s, s , out) can be eliminated if pre(in, s) holds as s , out are functionally determined by in, s.

We aim at solving a *relational verification problem* over two objects and reduce it to *inductive invariant* inference over a *composition* of two objects.

Definition 1. *Two objects* A *and* C *are* equivalent *if there exists an inductive invariant R over a* composition *of these objects, which satisfies all clauses below. It connects two states* St<sup>A</sup> *and* St<sup>C</sup> *before and after each pair of operations* (Op<sup>A</sup> <sup>n</sup> , Op<sup>C</sup> <sup>n</sup> )<sup>n</sup>∈[1,N]*.*

⎧ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩ initialization: Init<sup>A</sup>(as) <sup>∧</sup> Init <sup>C</sup> (cs) =<sup>⇒</sup> *<sup>R</sup>*(as, cs) consecution: *<sup>R</sup>*(as, cs) <sup>∧</sup> Op<sup>A</sup> <sup>1</sup> (in, as, as , out<sup>A</sup>) <sup>∧</sup> Op<sup>C</sup> <sup>1</sup> (in, cs, cs , out<sup>C</sup> ) =<sup>⇒</sup> *<sup>R</sup>*(as , cs ) ... *<sup>R</sup>*(as, cs) <sup>∧</sup> Op<sup>A</sup> <sup>N</sup> (in, as, as , out<sup>A</sup>) <sup>∧</sup> Op<sup>C</sup> <sup>N</sup> (in, cs, cs , out<sup>C</sup> ) =<sup>⇒</sup> *<sup>R</sup>*(as , cs ) safety: applicability: *<sup>R</sup>*(as, cs) <sup>∧</sup> pre<sup>A</sup> <sup>1</sup> (in, as) =<sup>⇒</sup> pre<sup>C</sup> <sup>1</sup> (in, cs) *<sup>R</sup>*(as, cs) <sup>∧</sup> pre<sup>C</sup> <sup>1</sup> (in, as) =<sup>⇒</sup> pre<sup>A</sup> <sup>1</sup> (in, cs) ... *<sup>R</sup>*(as, cs) <sup>∧</sup> pre<sup>A</sup> <sup>N</sup> (in, as) =<sup>⇒</sup> pre<sup>C</sup> <sup>N</sup> (in, cs) *<sup>R</sup>*(as, cs) <sup>∧</sup> pre<sup>C</sup> <sup>N</sup> (in, as) =<sup>⇒</sup> pre<sup>A</sup> <sup>N</sup> (in, cs) safety: outputs: *<sup>R</sup>*(as, cs) <sup>∧</sup> Op<sup>A</sup> <sup>1</sup> (in, as, as , out<sup>A</sup>) <sup>∧</sup> Op<sup>C</sup> <sup>1</sup> (in, cs, cs , out<sup>C</sup> ) =<sup>⇒</sup> out<sup>A</sup> <sup>=</sup>out <sup>C</sup> ... *<sup>R</sup>*(as, cs) <sup>∧</sup> Op<sup>A</sup> <sup>N</sup> (in, as, as , out<sup>A</sup>) <sup>∧</sup> Op<sup>C</sup> <sup>N</sup> (in, cs, cs , out<sup>C</sup> ) =<sup>⇒</sup> out<sup>A</sup> <sup>=</sup>out <sup>C</sup>

Implications in Def. 1 define a set of Constrained Horn Clauses (CHC) over an uninterpreted relation symbol *R*. There are three types of constraints: (1) initialization, (2) consecution, and (3) safety. The third, safety, reflects the actual relational specification, i.e., the correspondence between the programs under analysis, in terms of the user-visible variables, namely the input in, and the respective outputs, out and out . Here, safety is divided into applicability (coincidence of preconditions) and equivalence of outputs, which together ensure that the two programs are observationally equivalent. To prove that this equivalence holds, one needs to infer a more complicated invariant *R* over the internal state. For this reason, we need the initiation and the consecution constraints: whatever happens due to each operation, the invariant is maintained, and by safety, the programs remain observationally equivalent indefinitely.

Problem Statement: We seek an interpretation of *R* that satisfies all constraints in Def. 1 simultaneously. This conventional formulation of a CHC task lets us to use any off-the-shelf CHC solver. However, the problem is undecidable in general, thus no solver guarantees to handle our specific tasks. Furthermore, existing solvers mainly support the lightweight arithmetic theories, and a few exceptions support also ADTs [27] and arrays [21,32]. To the best of our knowledge, there is no CHC solver that supports ADTs and arrays *at the same time*, and there is no CHC solver that synthesizes recursive solutions.

Context: The system of CHCs ensures that A and C can be substituted interchangeably in any calling context, and it is applicable to a wide range of techniques for formal program development. The focus on equivalence instead of subsumption is not essential for our work, and the presented approach works for the asymmetric case just the same. Specifically, Liskov and Wing's substitution principle [36] follows (precondition strengthening is reflected by the applicability constraints from pre<sup>A</sup> to pre<sup>C</sup> , and all postconditions with respect to the outputs are equivalent). Data Refinement [15, 25] follows similarly (Def. 1 characterizes that *R* is a forward simulation [37]). See Sect. 7 for more details.

#### 3 Synthesis of Recursive Relational Invariants

In this section, we present the fundamentals of the approach to synthesize recursive relational invariants for systems over arrays and ADTs that we instantiate and illustrate on examples in the subsequent sections.

#### 3.1 Overview

Our approach is purely symbolic and fully automatic in both stages: generating a candidate relational invariant, and proving it correct (i.e., validating). The key insight is an analysis of the operations joint in the constraints of Def. 1. We follow a strategy of introducing recursion into the interpretation based on ADTs, and by aligning the base case to initialization and the recurrence conditions to joint operations. In particular, a relational invariant *R* that bridges an algebraic list xs

#### Algorithm 1: Automated synthesis of recursive relational invariants

Input: Objects A = (as,Init<sup>A</sup>,(Op<sup>A</sup> <sup>n</sup> )<sup>n</sup>∈<sup>N</sup> ) and <sup>C</sup> = (cs,Init<sup>C</sup> ,(Op<sup>C</sup> <sup>n</sup> )<sup>n</sup>∈<sup>N</sup> ), where as, cs are the state variables, and xs is a list variable of as Output: relational invariant *R* between A and C <sup>1</sup> *<sup>R</sup>*(nil, cs) <sup>←</sup> Init<sup>A</sup>(as[xs := nil]) <sup>∧</sup> Init<sup>C</sup> (cs); <sup>2</sup> φ<sup>r</sup> ← true; 3 let y and ys be fresh variables; 4 while true do <sup>5</sup> cs<sup>r</sup> <sup>←</sup> Update(Op<sup>A</sup> <sup>n</sup> , Op<sup>C</sup> <sup>n</sup> , as[xs := cons(y, ys)], cs) for some n ∈ N; <sup>6</sup> <sup>φ</sup><sup>r</sup> <sup>←</sup> <sup>φ</sup><sup>r</sup> <sup>∧</sup>Match(Op<sup>A</sup> <sup>m</sup>, Op<sup>C</sup> <sup>m</sup>, as[xs := cons(y, ys)], cs, csr) for some m ∈ N; <sup>7</sup> *R*(as[xs := cons(y, ys)], cs) ← φ<sup>r</sup> ∧ *R*(as[xs := ys], csr); <sup>8</sup> if Validate(*R*, A, C) then return *R*;

and an array (with auxiliary variables, such as index ) cs is defined recursively over the structure of xs, which produces this general schema:

$$\mathbf{R}(xs, cs) = \begin{cases} \phi\_b(cs) & \text{if } xs = \mathtt{nil} \\ \exists \ cs\_r. \ \phi\_r(y, ys, cs, cs\_r) \land \mathbf{R}(ys, cs\_r) & \text{if } xs = \mathtt{cons}(y, ys) \end{cases} \tag{1}$$

This schema has two placeholders for constraints, φ<sup>b</sup> in the base case and φ<sup>r</sup> in the recursive case, that may refer to the variables in scope (as indicated by their respective parameter lists). Moreover, we seek a Skolem function to eliminate the existentially-quantified state variable cs<sup>r</sup> in the recursive position. Intuitively the desired Skolem function captures the delta between two array states that corresponds to the delta between xs and ys.

Alg. 1 gives our top-level synthesis procedure for interpretations of *R*. It takes as input two objects, A and C, where as and cs are tuples variables that represent their respective states. We refer to primed versions of these state variables to as as and cs , assuming that all as, cs, as , and cs are distinct. The algorithm works with algebraic lists specifically and thus as is assumed to have such a component given by the state variable xs. We denote by as[xs := e] the updated vector of variables such that xs is replaced in as by symbolic expression e.

The base case of the interpretation of *R* is straightforward (line 1): the algorithm uses a predicate Init <sup>C</sup> and a predicate Init<sup>A</sup> in which the xs variable is instantiated to nil. The inductive case of the interpretation of *R* is trickier (line 7). Because several different operations that *produce* state, *consume* state, or *do nothing* with a state are possible (see Def. 2 later in the section), some of them might contribute to different parts of the interpretation being synthesized. In particular, methods Match and Update are responsible for generating a body of *R*. They are instantiated differently for our two recipes in Sect. 4 (applicable for stacks and queues) and Sect. 5 (applicable for (multi)sets and maps).

The first method, Update, synthesizes an updated symbolic state csr, a tuple of symbolic expressions, to be used in the nested inductive call of *R*. It can therefore be understood to compute a witness (or Skolem function) to existential quantifier in Eq. (1) as an expression of the remaining variables in scope, y, ys, as, cs. The second method, Match then collects constraints φ<sup>r</sup> from suitable transitions w.r.t. this csr.

In a loop for each candidate interpretation of *R*, our algorithm runs an automated SMT-based theorem prover [55] to validate it (line 8). The algorithm can iterate several times and converges after a successful theorem-prover run.

A noteworthy feature of our framework is that Update and Match should not necessarily be synchronized in pairs. Although cs<sup>r</sup> and the result of Match are going to be eventually combined and used in a single formula, the nondeterministic nature of our synthesis procedure suggests that the two ingredients may originate from potentially non-joint operations, thereby enlarging the search space of possible relational invariants.

#### 3.2 Classifying Operations

Our particular strategies for choosing ingredients for the inductive interpretation of *R* are based on the classification of the operations of the abstract object.

We define a partial ordering "" on ADT states that connects constructors discerned by the recurrence in *R* to the transitions of operations. With respect to this ordering, we can for example recognize operations that leave the ADT unchanged ("noops", which play a special role in Sect. 5), operations that "produce" constructors and thereby enlarge the internal state by additional elements and conversely operations that "consume" constructors. A natural choice for is the reflexive closure of the subterm ordering, where xs ys for lists specifies that xs is a suffix of ys. In general, this ordering can be used to control the result of the synthesis for specific applications, and is a heuristic choice. A choice which works well for our examples is that xs is a non-strict subsequence of ys.

The ordering naturally extends to tuples of variables (and thus, states), and lets us classify operations into the following three kinds.

Definition 2. *Let* Op *be an operation of an abstract object. Then,*

$$\begin{aligned} \text{isNo}(Op) & \stackrel{\text{def}}{=} \forall i, s, s', o. \, Op(i, s, s', o) \implies s = s'\\ \text{isPRon}(Op) & \stackrel{\text{def}}{=} \forall i, s, s', o. \, Op(i, s, s', o) \implies s \preceq s' \land \neg \text{isNo}(Op) \\ \text{isCons}(Op) & \stackrel{\text{def}}{=} \forall i, s, s', o. \, Op(i, s, s', o) \implies s' \preceq s \land \neg \text{isNo}(Op) \end{aligned}$$

*Example 1.* The class of an operation can often be identified by a cheap syntactic check to recognize when cons is applied to a current state or a next state variable. In the upcoming stack example in Fig. 1, from xs = cons(in, xs) we have that push is a producer operation, and from cons(out, xs ) = xs we classify pop as consumer operation. A top operation, not shown in Fig. 1, would be recognized as a noop (see also hasElement in the upcoming example in Fig. 3).

In the next two subsections, we introduce our particular strategies for the implementations of Update and Match of Alg. 1, in reference to Def. 2. Some operations fall into neither of the classes; or it may be hard to determine so if they do, given that Def. 2 is semantic; and different operations may contribute

different ingredients for a correct definition of *R*. To make use of as many operations as possible, we suggest strategies for all three classes of operations, to be able to synthesize a relational invariant in complex cases, even when complete information about the system is difficult to obtain.

#### 4 Recipe 1: Linear Scan

We identify a class of problems that require *scanning* the arrays in implementations of stacks and queues *linearly*. A distinguishing feature in this class is the presence of a numeric variable in cs through which array cells are accessed (denoted index in the rest of the section). We first illustrate the synthesis process on the following example and then present the algorithmic details.

#### 4.1 Motivating Example

Two realizations of a FIFO stack are shown in Fig. 1: one is based on linked lists, and another is based on arrays. They share a common interface of initialization and the two operations push and pop. For example, the encodings of pop of ListStack and ArrStack are respectively:

$$\begin{aligned} \left(\operatorname{Op}\_{\mathsf{pop}}^{\mathsf{ListStack}}(xs, xs', out)\right) \\ &= (xs \neq \mathsf{ni1} \land xs' = xs, \mathsf{tail1} \land out = xs, \mathsf{head}) \\ &= (xs = \mathsf{cons}(out, xs')) \\ \left(\operatorname{Op}\_{\mathsf{pop}}^{\mathsf{Arfack}}(a, n, a', n', out)\right) \\ &= (n > 0 \land a' = a \land n = n - 1 \land out = a[n']) \end{aligned}$$

where xs = nil and n > 0 are the preconditions, and out captures the return value. As an illustration, formula OpListStack pop (s, \_, 7) holds for all states s in which pop terminates and returns 7 (by convention we use \_ to denote terms that are irrelevant in a particular context). Note also that in the implementation of ArrStack, the popped value is not erased from the array – in order for a[n] to be considered in the future, it has to be rewritten by some push operator. In general, the array always contains infinitely many unknown values outside the range of cells a[0],...,a[n − 1] which are never accessed.

A possible relational invariant *R*(xs, n, a) bridging ListStack and ArrStack is defined as follows:

$$\mathcal{R}(xs, n, a) = \begin{cases} \text{ } & n = 0 \text{ if } xs = \text{nil} \\ n > 0 \land y = a[n-1] \land \mathcal{R}(ys, n-1, a) \text{ if } xs = \text{cons}(y, ys) \end{cases} \tag{2}$$

Intuitively, this *R* captures that a list xs has the same content as the portion of an array a between indexes 0 (including) and n (excluding). When xs is empty, then the portion of a should be empty too, thus n = 0. Otherwise, xs is created by cons-ing some other list ys and an element y then (1) n should be strictly positive, and (2) y should belong to the designated portion of a.

```
class ListStack:
  def init():
    xs = nil
  def push(in):
    xs = cons(in, xs)
  def pop():
    assert xs != nil
    out = xs.head
    xs = xs.tail
    return out
                                           class ArrStack:
                                             def init():
                                               n=0
                                               a = [...]
                                             def push(in):
                                               a[n] = in
                                               n=n+1
                                             def pop():
                                               assert n > 0
                                               n=n-1
                                               return a[n]
```
Fig. 1: Two implementations of a FIFO stack.

Fig. 2: Transitions of consumer operations (left) and producer operations (right) used to instantiate Eq. (1).

The schema in Sect. 3.1 has two placeholders for constraints, φ<sup>b</sup> in the base case and φ<sup>r</sup> in the recursive case, that may refer to the variables in scope (as indicated by their respective parameter lists). Moreover, we seek a state cs<sup>r</sup> in the recursive position. Placeholder φ<sup>b</sup> is instantiated by constraints from the initialization operations, such as n = 0 from ArrStack. This alignment of base case and initialization is not just a coincidence: many data structures start initially empty and are gradually populated by calling operations (e.g., collections).

The purpose of φ<sup>r</sup> in the recursive case of Eq. (1) is twofold. First, it connects a portion of the ADT state (specifically y) to the array state cs, in the example via a[n − 1] = y, and it determines a suitable array state cs<sup>r</sup> as an argument of the recursive occurrence of *R*. For instance, we take n − 1 for the recursive call but leave a unchanged. This is motivated by the observation that a state where xs = cons(y, ys) for some y, ys is *consumed* by pop. Using this information, the recurrence of *R* must align with the corresponding array transitions, too, as shown in Fig. 2 on the left. The constraint n > 0 is the precondition of the array operation, whereas y = a[n − 1] follows from comparing the outputs. As shown in Fig. 2 on the right, we can dually base the recurrence on push, which *produces* a cons, i.e., a transition from ys to xs = cons(y, ys) for some y. In this case, both transitions need to be viewed *in reverse* such that the respective successor states of push now match the left side *R*(xs, cs) of the schema. Then, the assignment n=n+1 can be rewritten to yield the equation n<sup>r</sup> = n − 1.

#### Algorithm 2: Update (recipe 1)

Input: Operations Op<sup>A</sup> and Op<sup>C</sup> , as[xs := cons(y, ys)] the shape of the state of A, cs the state variables of C, assuming cs = (\_, index , a) where index and a are variables of integer and array types, resp. Output: Updated arguments cs<sup>r</sup> <sup>1</sup> if isProd(Op<sup>A</sup>) then 2 let cs<sup>r</sup> = (\_, index , a ) be s.t. <sup>∀</sup>in, <sup>∃</sup>out . Op<sup>C</sup> (in, csr, cs, out); 3 return (\_, index , a); <sup>4</sup> if isConsm(Op<sup>A</sup>) then 5 let cs<sup>r</sup> = (\_, index , a ) be s.t. <sup>∀</sup>in, <sup>∃</sup>out . Op<sup>C</sup> (in, cs, csr, out); 6 return (\_, index , a);

#### Algorithm 3: Match (recipe 1)

Input: Operations Op<sup>A</sup> and Op<sup>C</sup> , as[xs := cons(y, ys)] the shape of the state of A, cs the state variables of C, cs<sup>r</sup> the updated state of C, assuming cs<sup>r</sup> = (\_, index , a) where index and a are variables of integer and array types, resp. Output: Formula φ<sup>r</sup> <sup>1</sup> if isProd(Op<sup>A</sup>) then <sup>2</sup> inv <sup>←</sup> GetLoopInvariant(index , Op<sup>C</sup> ); <sup>3</sup> return inv ∧ ¬Init<sup>C</sup> (cs) <sup>∧</sup> <sup>y</sup> <sup>=</sup> <sup>a</sup>[index ]; <sup>4</sup> if isConsm(Op<sup>A</sup>) then <sup>5</sup> return pre<sup>A</sup> <sup>n</sup> <sup>∧</sup> pre<sup>C</sup> <sup>n</sup> ∧ y = a[index ]; 6 return true;

To make this intuition practical, our approach suggests a particular strategy for picking operations to take constraints from, recognizing consumers and producers more generally, and validating the guessed relational invariants using induction and lemmas.

#### 4.2 Algorithm Description

Alg. 2 and Alg. 3 show the implementations of Update and Match, respectively, that suit stacks and queues. Recall that these algorithms are called from Alg. 1 and take as input pairs of nondeterministically chosen joint operations of A and C; state variables cs of C; current version of state variables cs<sup>r</sup> to be used in the recursive call of *R*; and fresh variables y and ys introduced in Alg. 1 to define the inductive rule of *R*. Outputs of Update and Match are respectively an updated tuple of variables cs<sup>r</sup> and a subformula ψ to be conjoined with the inductive definition of *R*.

If the producing operator is picked (line 1 of Alg. 2), then we have to find a term index , such that it would be transitioned by Op<sup>C</sup> to index . In particular, after assigning a new value to an array cell, index is monotonically updated (i.e., incremented like in the example in Fig. 1, or decremented). Thus, to access the array cell containing a new value using an updated value of index , we have to invert the arithmetic operation and obtain index −1 (for Fig. 1) or index + 1 (in the case of decrementation). Technically, in Alg. 2, it is realized by taking the index variable from cs, through which cells of the array can be observed (e.g., n in example in Fig. 1) and finding such a term index , that would be transitioned by Op<sup>C</sup> to index . Thus, the resulting cs<sup>r</sup> is composed from the same ingredients as cs where index replaces index .

If the consuming operation is picked (line 4), then we proceed in the reverse direction and find index that is a result of transitioning of index through Op<sup>C</sup> .

Alg. 3 for this recipe relies on the output of Alg. 2. Interestingly, it is supported even if cs<sup>r</sup> is computed using the producer, but ψ in Alg. 3 is computed using the consumer. Our particular strategy for the consumers in this recipe is 1) to use the precondition for Op<sup>C</sup> , and 2) to bridge the outputs of Op<sup>A</sup> and Op<sup>C</sup> via an equality. Alternatively, the inference via producer in line 1, in comparison, misses important constraint in the example, as the precondition of push is trivial. Such a situation can be mitigated by the discovery of a loop invariant (line 2) over index , i.e., usually just using Linear Integer Arithmetic (LIA), adding it, and blocking the initial state (to distinguish from the base case of the definition of *R*) in the inductive case of the interpretation of R being synthesized. Loop invariants are generated as follows as interpretations of predicate inv satisfying the following two implications:

$$\begin{aligned} Init^C(cs) &\implies inv(cs) \\ inv(cs) \land \left(\bigvee\_{n \in N} Op\_n^C(in, cs, cs', out)\right) &\implies inv(cs'); \end{aligned}$$

Note that these CHCs (over LIA) can be solved by numerous existing approaches. Without a query, ideally *the strongest* loop invariant is desirable; however in practice it suffices to apply lightweight techniques based on forwardpropagation of initial states using quantifier elimination, followed by its inductive subset computation [20]. This often finds an *adequately-strong* invariant.

*Example 2.* Recall the stack example in Fig. 1. Let the index term be computed by Alg. 2 via inverting the increment operation in push. Thus, it is used as an argument of the nested call to *R* in the inductive case of the definition of *R*. By construction, the a[index ] cell contains a value of in, i.e., the argument of push. At the same time, in is the argument of cons in Op<sup>A</sup> representing push, which lets us bridge the array and ADT in the proof. To allow this, Alg. 3 takes argument y of cons from the inductive definition of *R*, and equates it with a[index ], producing y = a[n − 1]. Combining it all together, we get the final solution, as shown in (2).

```
class ListSet:
  def init():
    xs = nil
  def hasElement(in):
    return contains(xs, in)
  def insert(in):
    xs = cons(in, xs)
  def erase(in):
    xs = removeall(xs, in)
                                            class ArraySet:
                                              def init():
                                                a = [false, false, ...]
                                              def hasElement(in):
                                                return a[in]
                                              def insert(in):
                                                a[in] = true
                                              def erase(in):
                                                a[in] = false
```
Fig. 3: Two implementations of a set, where the list is not necessarily duplicate-free.

#### 5 Recipe 2: Noop-based synthesis

In this subsection we present a recipe that suits sets, multisets, and maps, that are in some sense *non-linear*. That is, data structures do not maintain any index variable, which is usually used to access elements. Instead, arrays are viewed as maps, and the corresponding ADTs are equipped with recursive functions that traverse the data structure over and over again for each input. Oftentimes, these objects have noop operations, and our synthesis procedure makes use of them.

#### 5.1 Motivating Example

Fig. 3 shows two implementations of a set. The list-based implementation stores elements in the order of their insert-ions. The elements are not removed unless erase is called explicitly. Thus, duplicate entries of the same elements are allowed. The implementation uses the recursive contains and removeall functions that both traverse the list and search for a specific element:

$$\mathsf{contains}(xs, a) = \begin{cases} false & \text{if } xs = \mathtt{nil} \\ (a = y) \vee \mathsf{contains}(ys, a), & \text{if } xs = \mathtt{cons}(y, ys) \end{cases}$$

$$\mathsf{removed1}(xs, a) = \begin{cases} \mathsf{nil} & \text{if } xs = \mathtt{nil} \\ \mathit{ite}(a = y, \mathtt{removed1}(ys, a), \\ \mathsf{cons}(y, \mathtt{removed2}11(ys, a))) & \text{if } xs = \mathtt{cons}(y, ys) \end{cases}$$

The array-based implementation handles a map a from elements to Booleans. Initially, all cells in a are false. Inserting and removing an element is implemented by storing true and false to the corresponding cell respectively. The difficulty here is to support the shown implementation of insert and erase in Fig. 3, as well as possible variants that e.g., eagerly prune duplicate entries in the list-based implementation (see Sect. 6).

The expected output of our synthesis procedure is as follows:

$$\mathcal{R}(xs, a) = \begin{cases} \forall z. \ \neg a[z] & \text{if } xs = \mathtt{nil} \\ a[y] \land \mathcal{R}(ys, a[y:= \mathtt{contains}(ys, x)]), & \text{if } xs = \mathtt{cons}(y, ys) \end{cases} \tag{3}$$


#### Algorithm 5: Match (recipe 2)

Input: Operations Op<sup>A</sup> and Op<sup>C</sup> such that isNo(Op<sup>A</sup>) holds, as[xs := cons(y, ys)] the shape of the state of A, denoted as<sup>0</sup> below, cs the state variables of C, cs<sup>r</sup> the updated state of C Output: Formula φ<sup>r</sup> <sup>1</sup> <sup>φ</sup> <sup>←</sup> Op<sup>A</sup>(y, as0, as0, out) <sup>∧</sup> Op<sup>C</sup> (y, cs, csr, out); <sup>2</sup> return simplify(QE(∃out . φ));

#### 5.2 Algorithm details

Alg. 4 and Alg. 5 show the implementations of Update and Match, respectively, for this recipe. The arguments cs<sup>r</sup> of the nested call to *R* in the inductive case of the definition of *R* are computed in Alg. 4 using the symbolic encoding of *noop*. In the set example, *noop* is the hasElement operation, which allows observing the status of the internal state and does not modify it. We furthermore assume that the input of Op<sup>n</sup> coincides with the type of elements stored in the list, i.e., it is meaningful to call Opn(y, ···) with the list head y from the recursive case of (1) where xs = cons(y, ys).

The key idea behind Alg. 4 is to make necessary adjustments to cs to construct cs<sup>r</sup> that mirror any changes that can be observed via Op<sup>A</sup> when transitioning from list xs to ys in (1). This update is determined in terms of an auxiliary variables cs that are constrained to satisfy certain input/output pairs for the corresponding Op<sup>C</sup> , by case analysis whether the input is this particular y that is removed by the recurrence. The primary intention is to reassign a[y] appropriately. We do this by collecting constraints φ such that the output observed for Op<sup>C</sup> for y and cs matches that of the corresponding Op<sup>A</sup> on the smaller state with ys. This is also the key difference to Sect. 4, where we heuristically keep a unchanged in the recursive call in (1). The outputs for all other inputs z, however, are enforced to be unchanged w.r.t. the original cs, which is expressed by the constraint ψ. We then eliminate the quantifier for out (which is straightforward as the operations are deterministic) and rewrite the formula to closed expressions cs<sup>r</sup> for variables cs as result.

*Example 3.* Specifically for the example in Sect. 5.1, the algorithm proceeds by symbolic execution of hasElement, yielding formulas the following constituents:

$$\begin{aligned} Op^A &= (out = \mathsf{contains}(ys, y)) \\ Op^C &= (out = a[y]) \\ \phi &= (out = \mathsf{contains}(ys, y) \land out = a'[y]) \\ \psi &= (\forall z. \, y \neq z \implies \exists out'. \, out' = a'[z] \land out' = a[z]) \end{aligned}$$

The result ∃out . φ∧ψ of Alg. 4 is now solved for a . The only free variables refer to the states of the systems. Bound variables out and out can be eliminated by merging equalities over out and out :

$$a'[y] = \mathsf{contains}(ys, y) \land (\forall z . y \neq z \implies a'[z] = a[z])$$

The first conjunct therefore provides the update for a [y], whereas the second conjunct of φ states that a [z] should *not* be changed at indices other than y. After applying the axioms over the theory of arrays we get as result the following equality, which pattern matches the expected shape in line 4:

$$\text{QE}(\exists out . \phi) \iff (a' = a[x := \text{continuous}(ys, x)])$$

This transformation requires to "reverse-apply" the axiom of extensionality, i.e., switch from the pointwise comparison of a and a to an equality between the entire arrays. Note that while in general quantifier elimination is difficult, our current implementation has a limited, but often sufficient, support that can be extended by supplying rules to the underlying SMT-based theorem prover.

While Op<sup>A</sup> Alg. 4 predict *future* outputs of Op<sup>A</sup> for input y, Alg. 5 executes Op<sup>A</sup> on the state where xs = cons(y, ys) to obtain the *current* output of Op<sup>A</sup> for the same y. The generated constraint simply expresses that the output of Op<sup>C</sup> has to match. For hasElement we obtain the following formula:

$$\exists out \, . \left( \mathsf{contains}(\mathsf{cons}(y, ys), y) = out \right) \land \left( a[y] = out \right) \right)$$

Unfolding the definition of contains and simplification produces true = a[x], which is then used as the "*body*" of the inductive case of *R* in (3).

#### 6 Evaluation

We have implemented the approach in a prototype CHC solver called AdtChc<sup>3</sup>, relying on AdtInd [55] as an inductive prover, which in turn uses the Z3 [40] SMT solver to quickly perform the satisfiability checks over uninterpreted functions and linear arithmetic that are needed at various solving stages. AdtChc automatically determines the appropriate synthesis recipe through analyzing the

<sup>3</sup> The tool and benchmarks are available at https://github.com/grigoryfedyukovich/ aeval/tree/adt-chc.

syntax of the program (i.e., presence of index variables) and is able to successfully find relational invariants and prove them valid for all considered benchmarks.

We have evaluated the approach from Sect. 3 on different realizations of text-book data structures. The evaluation aims at answering two questions. Is the approach effective in the first place to discover suitable relational invariants, and how well can the necessary induction proofs be automated? The latter is relevant since Alg. 1 crucially depends on Validate in its refinement loop.

All our benchmarks require recursive invariants. They fall into two categories. First, stacks and queues from Sect. 4 (with variations that store values only to even indexes of the array) are solved based on linear scan. Second, sets, multisets, and maps, (that differ in whether, e.g., duplicate elements are stored in the respective lists) are solved with the approach in Sect. 5. We include such variations to reflect different trade-offs when designing specifications, and to demonstrate that our technique is reasonably flexible. The only userprovided lemma was required for the multiset benchmark (marked <sup>∗</sup> in Table 1): ∀ a, xs. num(a, xs)=0 =⇒ remove(a, xs) = xs.

The results from the evaluation<sup>4</sup> of both groups of benchmarks (resp., recipes used) are shown in Table 1. The choice which recipe to use was made by the tool itself at synthesis time. Total time (in seconds wall-clock) is entirely dominated by proof search in AdtInd, and includes the time for SMT queries. We remark that the time to synthesize the relational invariant is negligible in comparison to the proof time (and the proof time is often proportional to the number of internal SMT calls).

Table 1: Invariant synthesis timings.


Most proofs are found using the default proof strategy (the same for every benchmark) within 20s. This is caused by the large proof search space created by a combination of array simplification and forward rewriting. We have also tested our tool of buggy implementations, e.g., in which the consumer operations are correct (and can be used for correct guesses of relational invariants), but producers are not. Expectedly, the tool is unable to synthesize a relational invariant for the whole systems in these cases.

We have already presented the relational invariants found for the stack (2), for the stack variant that stores to even array indices only, counter n is decreased by 2 instead of 1 in the recursive call as expected. Relational invariant *R*(xs, m, n, a) for the queue benchmarks keeps two indices into the array a, depending on the variant, the first element of the list xs is found at a[m] or a[n]

<sup>4</sup> The evaluation was conducted on MacBook Pro, Processor: 2 GHz Intel Core i5, Memory: 8 GB 1867 MHz LPDDR3, MacOS v10.14.6.

and the recursion either increases m or decreases n. The relational invariants for the multiset and map examples are analogous. All necessary lemmas are automatically discovered and proved by AdtInd, as an example for the set benchmarks: ∀ xs, s, x. *R*(xs, s) =⇒ contains(x, xs)=s[x].

#### 7 Related Work

Although there exist automated techniques to synthesize relational invariants, nothing was proposed to deal simultaneously with ADTs and arrays. Conceptually, our approach is related to SimAbs, an SMT-based algorithm to simulation synthesis [18]. SimAbs exploits a space of possible simulations and (dis-)proves them using an off-the-shelf decision procedure. Guesses for simulation relation are obtained also from the source code, by matching variables from two programs. Alternatively, simulation relations can be inferred from test runs [49] or through translation validation [41]. Our approach allows dealing with objects (not just imperative code) and contributes several novel strategies for guessing and proving non-trivial simulation relations.

Discovery of invariants to relate the behaviors of two programs or other ways of establishing program equivalence is an active research area [5,14,22,23,39,44, 51]. These approaches typically reduce the relational verification problem to a safety verification problem and rely on the existing tools—often, solvers for constrained Horn clauses (CHC). Currently, since ADTs and arrays are challenging for the underlying solvers, the applicability of the approaches to our tasks are also limited. There are decision procedures for abstraction of ADTs to lists, sets, and multisets [52], however, these apply to certain predefined abstractions only.

Our approach can be seen as an application of Syntax-Guided Synthesis (Sy-GuS) [2]. Strategies dependent on types of benchmarks essentially represent sets of syntactic templates filled iteratively and checked using an SMT solver. SyGuS is successfully used also in CHC solving [19,21] and in lemma synthesis [46,47,55]. There are only a few approaches [21, 28, 31, 55] that apply SyGuS to synthesize formulas over ADTs or arrays/quantifier. Data-driven approaches are complementary to such syntax-based approaches, e.g., [38]. Neither deals with arrays, quantifiers, and ADTs at the same time.

Unno et al. [53] support recursive predicates, by taking the least solution of initialization and consecution as the definition of *R*, however, this may lead to rather cumbersome inductive cases (e.g., for pop in the stack). We avoid the problem by basing the recurrence scheme on the data structure, and infer constraints that are well aligned to that scheme from the operations. Jennisys [34] tackles the related problem of generating recursive implementations from an abstract model, where the simulation relation is given.

More generally, the problem addressed in this work relates to the idea of step-wise refinement, originally conceived by [16] and [54] as a guideline to organize software development and later studied extensively in a formal setting for rigorous assessment of functional correctness (e.g., [1, 4, 15, 25, 29, 33, 36]). The standard proof technique relies on simulation relations [37] that couple the two state spaces, which is directly reflected in the CHC system of Def. 1.

Many methods and tools support development using formal refinement [1,4, 8,17,26,29,33,45]. Large-scale verification projects that are based on refinement include seL4 [30], FSCQ [10], Flashix [48], and CompCert [35], with high human effort involved. Correct-by-construction correspondence between low-level code and high-level data types helps to some extent in, e.g., [13] and Cogent [3]. Recent work on "push-button" verification includes a verified TLS library [12], AWS C Common library [11], file system [50], a hyperkernel [42], network functions [56], where the high degree of proof automation is in part achieved by statically bounding the state space of the systems. The latter work [56] specifically notes how non-experts can formulate high-level correctness requirements (their specifications are written in Python), as evidence that refinement-based approaches may ultimately overcome the "specification bottleneck" [6, 43].

#### 8 Conclusion and Outlook

We have demonstrated an approach that can fully automatically synthesize and prove relational invariants over recursive data types and arrays. The approach is based on introducing quantifiers and recursion into the definition of such relations in a systematic way, and by instantiating this schema with constraints from joint transitions of the two systems. A somewhat surprising insight was that it is useful to view such transitions both forward and in reverse, leading to the classification into producers and consumers as a guideline for the search.

We have presented a general synthesis algorithm and two concrete instantiations for different data structures of different sorts. The approach is fully automatic in guessing a relation and proving it correct. It relies on the recently developed CHC solver called AdtChc which in turn is based on an SMT-based theorem prover AdtInd featuring a support for arrays, quantifiers and structural induction. The approach is modular and can be extended by further synthesis strategies in the future. In particular, since based on CHC techniques, it can be integrated with other existing CHC solvers tailored to non-ADT reasoning, and can be used in large-scale verification frameworks such as [24] that reduce the safety verification to CHC tasks.

Many more interesting benchmarks lend themselves for further investigation: positional insertion and removal of lists, amortized data structures, benchmarks based on trees or nested arrays, and ultimately some real-world software systems. With a growing search space, it becomes more important to quickly recognize incorrect simulation relations, e.g., by evaluation-based counter-examples (cf. [31]), to prevent costly proof attempts. Similarly, incorporating external tools for invariant generation is another topic for future work.

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### A Two-Phase Approach for Conditional Floating-Point Verification

Debasmita Lohar<sup>1</sup> (-), Clothilde Jeangoudoux1, Joshua Sobel2, Eva Darulova<sup>1</sup> , and Maria Christakis<sup>1</sup>

<sup>1</sup> MPI-SWS, Saarland Informatics Campus, Saarbrücken and Kaiserslautern, Germany, {dlohar,jeangoudoux,eva,maria}@mpi-sws.org

<sup>2</sup> University of Rochester, Rochester, USA, jsobel3@u.rochester.edu

Abstract. Tools that automatically prove the absence or detect the presence of large floating-point roundoff errors or the special values NaN and Infinity greatly help developers to reason about the unintuitive nature of floating-point arithmetic. We show that state-of-the-art tools, however, support or provide non-trivial results only for relatively short programs. We propose a framework for combining different static and dynamic analyses that allows to increase their reach beyond what they can do individually. Furthermore, we show how adaptations of existing dynamic and static techniques effectively trade some soundness guarantees for increased scalability, providing conditional verification of floating-point kernels in realistic programs.

#### 1 Introduction

Floating-point arithmetic is widely used across many domains, including machine learning, scientific computing, embedded systems, and the Internet of Things. Floating-point computations resemble real-valued arithmetic, but provide only finite precision, which commits roundoff errors at potentially every operation. While these errors are individually small, they propagate through an application and can make its results meaningless [47]. In addition, floating-point arithmetic features special values such as not-a-number (NaN) and Infinity [48]. As a result, these computations are very challenging for developers to reason about and debug manually. There is, therefore, a clear need for automated verification and debugging techniques for such computations.

Unfortunately, today's techniques do not handle realistic floating-point programs well. Consider for example a program that simulates the interaction of several bodies under gravity. We took a C implementation of this N-body problem from Rosetta Code [5], which takes as input the masses, positions and velocities of—in our case—three bodies, and shows their evolution over a number of timesteps. The entire program is moderately-sized with 108 lines of code. Suppose that we want to verify the absence or presence of special floating values and cancellation (i.e. large roundoff) errors in this program. None of the currently available floating-point analysis tools is able to do this.

c The Author(s) 2021 J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 43–63, 2021. https://doi.org/10.1007/978-3-030-72013-1\_3

```
1 int main(int argc, char* argv[]) {... // Reads masses, positions and velocities
2 for(int i=0; i<timeSteps; i++) { simulate(mass, pos, v); ...}
3 }
4 void simulate() { compute_accelerations(mass, pos); ...}
5 void compute_accelerations(double mass[], vector pos[]){
6 for(int i=0;i<bodies;i++){ ...
7 for(int j=0;j<bodies;j++) {if(i!=j) {
8 acc[i] = numerical_kernel(mass[j], pos[i], pos[j], acc[i]);}}}}
9 vector numerical_kernel(double mass, vector pos_i, vector pos_j, vector acc) {
10 return addVectors(acc, scaleVector(g*mass/pow(mod(subtractVectors(pos_i,pos_j)),3),
          subtractVectors(pos_j,pos_i))); // compute acceleration
11 }
```
Listing 1.1. Snippet of Rosetta code N-body simulation

State-of-the-art static roundoff-error analysis tools [33,31,30,60,65,72] are in principle capable of proving the absence of both special values and large roundoff errors by computing an abstraction of the possible behaviors. However, they work only on small programs, mostly consisting of a single function, and thus do not work for our N-body example. The static tools that do scale [11,63,43] suffer from large over-approximations due to abstractions and thus effectively cannot prove the absence of issues either. Bounded model checking [52] or SMT decision procedures [25] perform exact bit-precise reasoning, but do not scale enough due to the complexity of floating-point arithmetic.

On the other hand, there exist dynamic analyses that search for concrete inputs proving the presence of Infinities [38], NaNs or cancellation errors [10,21,78]. We could not apply any of these tools on our example, to a large part because they, too, have been designed for relatively small programs. More guided techniques such as symbolic execution [57] rely on a back-end SMT solver, for which floating-point theories have very limited scalability.

We evaluated representative available tools on a new collection of floatingpoint benchmarks and get similar results for most of them (Section 5).

We observed that often only a relatively small part of a program performs complex numerical computations—we call these parts the *numerical kernels*. Existing state-of-the-art floating-point analyzers can be applied to these kernels, provided that one can supply a precondition that bounds the kernel's input ranges (their minimum and maximum values). Obtaining such preconditions manually is challenging, since the kernels are usually nested in loops as functions. Listing 1.1 shows a subset of the N-body example; the numerical kernel that we identified is on line 9, nested behind several for-loops and function calls.

Based on this observation, we propose a two-phase analysis that combines different program analyses to conditionally verify the absence of special values and cancellation errors in numerical kernels 'concealed' in large programs. First, we employ a scalable program analysis to infer the ranges of a kernel's inputs in

the context of the containing application. In the second phase a different program analysis assumes these ranges to verify the kernels.

The main insight behind this combination is that the first scalable analysis does not need to perform sophisticated floating-point reasoning; the domain specifications required for the second numerical analysis need to only capture input ranges of variables.

The main challenge in our two-phase analysis is the first phase where our objective is to infer the ranges of the kernel inputs automatically. We first attempt to verify the numerical kernels fully soundly. Hence, we utilize abstract interpretation to infer sound ranges of kernel inputs. In case it is unable to infer useful (finite) ranges for the kernels, we propose to adapt existing blackbox and greybox fuzzing techniques [12], and evaluate them in their ability to produce large kernel input ranges capturing as many feasible inputs as possible.

After inferring the kernel ranges, the second phase utilizes a slightly adapted existing static and sound roundoff error analysis [30] to verify the kernels. In case this analysis produces warnings for special values, we additionally utilize SMT-based bounded model-checking [52] to check for spurious warnings.

Although there is a large body of work on combining different program analyses, our goal of analyzing real-world applications to verify their numerical kernels is novel. Our combination is specifically tailored to this setting, by considering the intricacies of floating-point arithmetic and the limitations of today's analysis techniques in reasoning about them.

Using a dynamic analysis in the first phase means that we are only able to infer approximations of the kernel input ranges. Consequently, we can verify the kernels only *conditionally*, because the verification is performed under the assumption that the input-domain specifications precisely describe possible values of the kernel inputs. Thus, we take a practical standpoint and relax the soundness guarantees in favor of wider applicability of today's static floating-point roundofferror verification techniques.

Our evaluation shows that for 16 out of 24 kernels, this approach is able to verify that no special floating-point values occur; for 3 of those kernels, verification is sound. For 14 kernels, we additionally show the absence of cancellation errors that are a main cause of large roundoff errors.

*Contributions* To summarize, our paper makes the following contributions:


Our benchmarks, the tool Blossom as well as scripts of all of our experiments are available at https://github.com/dlohar/blossom.

Fig. 1. Overview of our approach

#### 2 A Two-Phase Approach

Figure 1 shows an overview of our two-phase approach that strives to increase the reach of existing floating-point analyses of floating-point numerical kernels. Our key observation is that such kernels appear in real-world applications from a variety of domains, but they are often 'hidden' behind several function calls and other non-numerical code that the round-off analyzers cannot handle. The first phase infers bounds on the input variables of a set of numerical kernels K that have been identified by a user in a program P. In the second phase, we utilize these ranges to (conditionally) verify the kernels, i.e. to (conditionally) prove the absence of special values and large roundoff errors.

An alternative strategy would be to identify the largest kernel input ranges for which correctness can be guaranteed. However, even if one could infer such preconditions (we are not aware of a tool that performs such a backward analysis), our techniques for the first phase would still be needed to determine whether the program can execute the kernels on inputs outside of the safe ranges.

#### 2.1 First Phase: Whole Program Analysis

In the first phase we have a whole program analyzer that, starting from the *program inputs* constrained by I, infers bounds R on the *kernel inputs* automatically. These bounds are crucial, as the presence of cancellations and special values directly depends on the ranges of possible values; an unbounded input range will, in general, also lead to unbounded roundoff errors and special values.

To obtain the kernel ranges, we need to analyze the entire program. In general, it is infeasible to compute the exact ranges, so that we want to approximate them. We propose to first use a sound static analysis, which computes an over-approximation of the true ranges. They thus cover all feasible inputs, but additionally also spurious ones, so we want these ranges to be as tight (small) as possible. If the abstractions necessarily performed by the static analyzer become prohibitively large, we propose to use dynamic analysis to compute an unsound approximation of the kernel ranges. These ranges should be as wide as possible to capture as many concrete executions as possible.

*Sound Static Analysis* We choose abstract interpretation [26] and specifically the industry-strength analyzer Astrée [63] to infer a sound over-approximation of the kernel ranges, as Astrée scales for large programs with complex code and data structures and comes with a variety of abstract domains.

The choice of the abstract domain in Astrée is, in general, a trade-off between the amount of over-approximation and the analysis running time. The interval domain abstracts a set of concrete variable values by their lower and upper bounds: [x, x] := {x | x ≤ x ≤ x}. While operations on interval arithmetic [64] are efficient, intervals cannot capture correlations between variables and therefore over-approximate the real behavior (e.g. x − x = 0 in interval arithmetic). Nonetheless, for our benchmarks we have not observed any noticeable difference in the results with more sophisticated domains (e.g. octagon). This is likely due to our benchmarks having many nonlinear operations. Hence, we choose the interval domain as the numerical abstract domain for our purpose.

*Dynamic Analysis* Fuzzing finds inputs that demonstrate certain (unwanted) behavior. We propose to fuzz a program and at the same time monitor the kernel inputs to record the lower and upper bounds seen during concrete executions.

We instrument each user-specified kernel in the program with a kernel monitor that keeps track of the smallest and largest value seen for each kernel input. We repeatedly execute the instrumented program and report the minimum and maximum values seen for each kernel input over all executions. This approach crucially depends on the choice of program inputs that are used for fuzzing. We propose and experimentally compare blackbox, guided blackbox, and directed greybox fuzzing [12] as methods for input selection in Section 6.

*Blackbox* fuzzing is a naive but effective technique in many testing situations. In our setting, the blackbox fuzzer randomly draws inputs from the program ranges I, i.e. without any reference to the internal structure of the program.

We further propose *guided blackbox* fuzzing that is guided toward enlarging the kernel input ranges. For this, the program input generator records those inputs that have widened the kernel ranges, and randomly generates new inputs that are within a certain (small) distance from these, in the hope that the new inputs would enlarge the monitored ranges even further.

While blackbox techniques are straightforward to implement, they do not take into account the program structure. We thus evaluate an adaptation of *directed greybox* fuzzing, implemented in the the state-of-the-art tool AFLGo [12] that can be directed toward specific program locations, while exploring as many different paths in the program as possible. We first fuzz the program to obtain an initial estimate for the kernel input ranges with AFLGo (targeting the kernel). Then, we employ AFLGo in a refinement loop that iteratively attempts to widen the currently seen kernel input ranges. We instrument the kernels with conditional statements that check whether a kernel input is outside of the current kernel range. We use this conditional statement as a target for AFLGo, effectively directing it to find kernel inputs that are outside of the current estimate. If AFLGo finds a program input that widens the current kernel input range, we update it accordingly and iterate the process until a user-defined timeout.

#### 2.2 Second Phase: Numerical Kernel Analysis

With the ranges (R) inferred in the first phase, we analyze the user-identified numerical kernels (K) in the second phase with a static analyzer. Our objective in the second phase is to either show the absence of special floating-point values and large roundoff errors in a kernel or to generate warnings for the potential presence of such values.

We use the sound floating-point roundoff analysis tool Daisy [30], which automatically proves the absence of special values and computes an absolute error bound for each kernel output. When Daisy generates a warning that special values can potentially occur, we use a SAT/SMT-based model checker that performs exact floating-point reasoning and that can identify spurious warnings.

By itself, the error bound on the *kernel* output is not particularly helpful, however, since we do not know how this error propagates to the *end* of the program (although there exist scalable analyses that potentially can compute this information, e.g. [61]). That said, for many numerical applications the exact error bound is not important, since the algorithm itself is already approximate. For these applications, it is thus sufficient if we can show that the *roundoff errors are not too large*. We thus modify Daisy to report a warning when it detects a possible *cancellation*, i.e. when an arithmetic operation increases the relative error significantly (e.g. when two values that are close in magnitude get subtracted [42]). Additionally, Daisy includes an optimization procedure that can improve the accuracy of the kernels by rewriting the arithmetic expressions to commit smaller roundoff errors. We provide more details in Section 4.

#### 2.3 Soundness Guarantees

To summarize, using the extended Daisy analysis, we can conditionally verify that kernels do not result in any NaN or Infinity, and that they do not commit cancellation errors, i.e. lead to large roundoff errors. When the kernel input ranges are computed soundly using abstract interpretation (e.g. Astrée), our verification is conditional in that we only verify the absence of cancellations for the kernels, but not for the rest of the program.

When the ranges are computed using dynamic analysis in the first phase, they include more concrete values than the fuzzer witnessed. Values between the lower and upper bound are not necessarily observed by the fuzzer, and are also not necessarily feasible. If one were to consider only values witnessed at runtime, then it would be possible to analyze kernels for individual traces, although this would be quite expensive [10]. However, if we can soundly show that no special values or large roundoff errors (cancellations) occur inside a kernel for a given input range, we have shown this for more executions than can be explored by dynamic testing in general (since there are usually too many floating-point values to explore exhaustively). Unlike for a NaN or Infinity that are obvious to detect, cancellation cannot, in general, be detected by inspecting the computed results and thus our combination is valuable.

#### 3 First Phase: Whole Program Analysis

*Abstract Interpretation with Astrée* We utilize Astrée as it scales for large C programs with complex code and data structures. We add wrapper functions to provide bounds for global variables, since Astrée does not assume ranges for global variables directly. We further annotate the kernels K with Astrée's \_\_ASTREE\_log\_vars() construct. This construct records the range information that Astrée logs about the kernel inputs at the entry of the kernels.

Note that the analysis of Astrée can be extensively parameterized with the knowledge of the program under analysis. Although this makes the analysis even more precise, it requires vast manual effort and knowledge of the intricacies of the program. To avoid this, we parameterize Astrée as generically as possible. We only use semantic loop unrolling until a defined loop bound to reduce the over-approximation in the analysis for all benchmarks.

*Blackbox Fuzzing with Blossom* We implement our novel blackbox fuzzing for kernel range computation in a tool we call Blossom. Blossom works by instrumenting the program to be analyzed. Blossom is implemented as an LLVM pass and works on C, C++, and Rust input programs with complex programming constructs and data types (and would work for any programming language that compiles to LLVM). Blossom takes as input the program P, a configuration file that specifies the ranges of program inputs, the fuzzing technique that we want to execute (standard or guided blackbox), and a timeout. The LLVM pass automatically instruments P by inserting code that performs the indicated fuzzing process until the specified timeout, and records the ranges of kernel inputs.

In order to perform vanilla blackbox fuzzing, the code is instrumented with an input generator that utilizes the srand() function with distinctive seeds to randomly generate values of program inputs from the set of input bounds I. This process is continued until the specified timeout.

*Guided Blackbox Fuzzing with Blossom* Algorithm 1 shows our guided blackbox fuzzing algorithm for generating program inputs to maximize kernel ranges. The algorithm is also implemented via LLVM-pass instrumentation in Blossom.

The inputs to Algorithm 1 are the program P with an identified set of kernels K, a set of n program input ranges (I), and a timeout (T). The algorithm is also parameterized by the number of mutations m and a constant c that determines the neighborhood radii for all program inputs from which mutants (new program inputs) are drawn. The algorithm returns a set of kernel ranges [{Rlo}, {Rhi}] (line 16). The goal is to compute the interval [{Rlo}, {Rhi}] as wide as possible.

The algorithm keeps an input queue Q, which stores program inputs on which the program is to be executed. If Q is empty, m new random inputs taken from the program input ranges I are added to it (line 6–7). If Q is not empty, the algorithm first dequeues one valuation of all the program inputs {v1, ··· , vn} from Q (line 9), and executes the program P on these program inputs. During the execution of the program, the kernel monitor checks the kernel inputs and updates the kernel ranges as it is done in vanilla blackbox fuzzing (line 10). If the

#### Algorithm 1 Guided Blackbox Fuzzing


kernel ranges were updated, i.e. we found an input that led to the kernel input being outside of the currently known range, we generate m − 1 mutants from a program input {v1, ··· , vn} by randomly drawing inputs from its neighborhood v<sup>1</sup> ∓ r1, ··· , v<sup>n</sup> ∓ r<sup>n</sup> and add them to the queue (line 12–14). (We draw mutants randomly from the neighborhood to reduce the possibility of duplicate program inputs.) The neighborhood, i.e. maximal distance of a mutant to the original program input, is defined by the neighborhood radii {r1, ··· , rn} (computed once on line 3) that depend on the width of each input range. Effectively, if an input range is large, then we will draw mutants from a larger neighborhood as well. This step enables to search in the neighborhood of the inputs that enlarged the ranges of the kernels recently. Then, we generate one random input for all variables in the whole input range (line 15). This step ensures that we do not get stuck in a local maximum or minimum. The whole process is repeated until timeout T.

#### 4 Second Phase: Static Analysis with Daisy and CBMC

Next, we use the computed kernel ranges R as kernel input specifications (preconditions) and adapt the state-of-the-art roundoff-error analyzer Daisy [30] to verify the absence of cancellation errors and special float values. The translation of kernels and the precondition annotation to Daisy's input language in Scala is currently done manually, but could be automated in the future.

Daisy's core roundoff-error analysis performs a forward dataflow analysis. It computes ranges and worst-case absolute error bounds for each intermediate arithmetic (abstract syntax tree) expression using the interval and affine arithmetic abstract domains. As part of this analysis, it checks for overflows and invalid

expressions that could lead to NaN values, as their absence is a prerequisite for a meaningful roundoff-error computation.

We extend Daisy to check at every intermediate expression for a possible cancellation, using the ranges and absolute error bounds that Daisy computes by default. At each binary arithmetic operation, we compare the relative errors of the operands with the relative error of the binary operation result. If the relative error increases more than a given factor, we report an error. We compute the relative error for an intermediate expression x as the ratio of its worst-case absolute error bound divided by the smallest value that the range of x contains. When the range of x ([x]) contains zero, we divide instead by some small constant c, <sup>Δ</sup><sup>x</sup> max(c,min([x])) , to make relative errors always well-defined. While this does not compute a sound bound on the relative error, this is not needed for our purpose, since we are only interested in a relative comparison.

With this extension, we can prove for each kernel and the specified kernel input ranges, that cancellation and special values do not occur (but we cannot prove their presence). When Daisy cannot show this, it issues a warning with the possibly problematic intermediate expression. Spurious warnings for special values can be checked with a tool that performs exact reasoning, e.g. CBMC [52], and which reports a counterexample trace to the user who can use this trace to confirm whether the warning is genuine and if so, for debugging.

*Optimizing the Kernels* Daisy furthermore provides a rewriting optimization that finds an ordering of an arithmetic expression for which it can show a smaller (absolute) roundoff error [32]. The rewriting relies on the fact that floating-point arithmetic is not associative and distributive and hence different evaluation orders commit errors of different magnitudes. Daisy's algorithm uses real-valued identities such as associativity and distributivity to rewrite the expression. Using this optimization, we can thus locally improve the accuracy of the numerical kernels.

#### 5 State of the Art on Real-World Programs

We collected a new set of real-world numerical programs from different application domains, as existing floating-point benchmark sets [29] cover kernels only. We first report on our experiments using existing representative state-of-the-art tools on these benchmarks, before evaluating our approach in Section 6.

*Benchmarks* All our benchmark programs are existing programs collected online from a variety of domains such as scientific computing simulations (nbody, pendulum, lulesh, reactor, molecular), physics algorithms (fbench, arclength), numerical methods (linpack) and machine learning (linearSVC). Table 1 provides an overview of the size and complexity of our benchmarks, as well as the number and arithmetic complexity of the kernels that we chose for verification. We also count the number of trigonometric operations (implemented in library functions) in the kernels, and the 'depth' column shows the number of function calls needed to reach the kernels from program entry.


Table 1. Benchmark statistics

These benchmarks are single-threaded C or C++ floating-point programs with arrays, structures, branching, loops, and function calls (we translated the pendulum benchmark manually from Python to C). We modified the benchmarks by replacing dynamic memory allocation, pointer arithmetic, and I/O operations as appropriate, since these are challenging for most program analyses. We considered two versions of fbench: one with user-defined trigonometric functions (V1) and 380 LOC, and another with their library versions (V2). We specified bounds on the program inputs manually and identified a set of numerical kernels containing a large number of arithmetic operations.

*State of the Art* We first evaluate existing state-of-the-art tools on our benchmark set. For this, we choose CBMC, Astrée and AFLGo as representatives for model checking, abstract interpretation and directed greybox fuzzing, respectively. To the best of our knowledge, AFLGo was not used for floating-point debugging before. These tools check for assertion violations, so we have added assertions to our chosen kernels to check for absence of Infinity and NaN using the standard library functions isinf and isnan.

We do not include a deductive verifier (e.g. [24]) in this comparison, because it requires detailed user annotations of every function. None of the state-ofthe-art static roundoff-error analysis tools [43,33,31,30,60,65,72] work on the whole applications in our benchmark set. Available dynamic analyses for finding large roundoff errors [10,21,77,21,78,44] or special values [38,57,9] also work only on smaller programs (often restricted to kernels). Only the dynamic-analysis tool FPDebug [10] has been shown to scale beyond numerical kernels, but unfortunately the code has not been actively maintained over the years.

All experiments are done for 64-bit precision and on a Debian server system with 2.67GHz and 50GB RAM. We have used CBMC version 5.12 with MiniSat 2.2.0 (we have observed in our preliminary experiments that CBMC performs better with MiniSat), Astrée's linux64\_b5162300\_release and AFLGo downloaded on June 9, 2020. We have set a 1-hour time budget for all experiments and unrolled all loops for 50 iterations for both CBMC and Astrée.

With CBMC and Astrée, we are able to prove the absence of special float values in linearSVC and rayCasting, two of the smallest benchmarks. Additionally, Astrée also proves the absence of special values in kernels 1 and 5 in fbenchV1. For all other C benchmarks (Astrée does not work on C++ programs), Astrée generates warnings for the potential existence of special values. With AFLGo, however, we do not find any special values within the time limit.

For the nbody and pendulum benchmarks, we originally had larger program input ranges. For these, AFLGo was able to show the presence of special values in the kernels, suggesting that greybox fuzzing is effective for detecting special values. For the subsequent experiments, we have used tighter program input ranges to avoid special values.

#### 6 Evaluation of our Two-Phase Approach

We next evaluate our two-phase approach. For a fair comparison with the state-ofthe-art tools, we designate a 1-hour time limit for the entire analysis, allocating 50 minutes for generating the kernel ranges and 10 minutes for the kernel analysis. We have empirically evaluated the effect of the time limit and observed that increasing the time does not affect the results of our benchmarks, but a smaller time limit led to worse results.

*Computing Kernel Ranges* The main step is the computation of the kernel ranges. We compare the kernel ranges obtained with blackbox fuzzing (BB), guided blackbox fuzzing (GBB) (both implemented in Blossom), AFLGo with our iterative widening (AFLGo), and a combination of BB and AFLGo iterative widening (BB+AFLGo). We have empirically determined that with 5 mutants GBB performs the best for all our benchmarks. For AFLGo, we first fuzz the program for 5 minutes and then run our iterative widening that employs the fuzzer in a refinement loop to widen the so-obtained ranges (see Section 2.1) for the next 45 minutes. For BB+AFLGo, we use Blossom's blackbox fuzzing for 25 minutes to generate the initial ranges. On these ranges, we use our range-widening technique with AFLGo for the next 25 minutes.

To compare the obtained kernel ranges, we first compute the width of each kernel range (x − x) and show in Table 2 the average width over all kernel inputs and over 5 runs with different random seeds. For our dynamic analyses, we want to maximize the kernel ranges to cover as many kernel inputs as possible.

We also add the sound over-approximated ranges computed by Astrée, whenever these are available. While Astrée produces a warning *inside* the arclength kernel, it still computes a finite range for the kernel *input*.

In 5 out of the 7 kernels where Astrée finds non-trivial ranges, our fuzzing techniques also compute ranges that are close to Astrée's. They are even equal in the case of rayCasting. In the other 2 cases, Astrée reports big ranges whereas


Table 2. Comparison of kernel ranges generated by different techniques and settings

all fuzzing techniques compute smaller ranges with the same width, suggesting a possible large over-approximation of Astrée's ranges (or the inability of fuzzers to discover new kernel inputs within the time limit).

In the other cases, when Astrée finds unbounded ranges or does not work, we observe that for all but 3 kernels, all four fuzzing techniques compute very similar range widths. For 3 kernels, however, GBB finds significantly larger ranges, thus discovering kernel inputs that the other methods are not able to find. We thus conclude that guided blackbox fuzzing appears to be most suitable for computing kernel ranges in our benchmarks, as it can discover apparent outliers.

AFLGo often computes the smallest ranges. Our hypothesis is that because AFLGo aims to maximize the number of paths in the program to reach the target locations in the kernels, it focuses on generating values to find new paths rather than generating values exercising an already found path that may increase the width of the kernel ranges.


Table 3. Variation of computed kernel range widths (from the average width) for our three fuzzing techniques (in %), '-' denotes no variation

*Effect of Randomness* All fuzzing techniques (BB, GBB, AFLGo) rely on randomness. To evaluate how the computed kernel ranges are affected by it, we calculate the variation of the range widths compared to the average range width (per variable) over 5 runs. For 7 kernels, we do not detect any variation at all for any of the methods; Table 3 shows the variations for the remaining kernels.

We observe that all methods have large variations for the benchmarks nbody and linpack, i.e. those for which GBB has found very large ranges. This suggests that there are a few corner-case inputs that lead to large kernel ranges (which only GBB was able to reliably find). Further, we see that AFLGo has a large range variation due to randomness for a few additional benchmarks, whereas BB and GBB have variations that are relatively small.

*Conditional Kernel Verification* We were able to (conditionally) prove the absence of special floating-point values for 16 out of the 24 kernels, and (conditionally) prove the absence of cancellation errors for 14 of those kernels. We show these results in the last column of Table 2: '✓' indicates that Daisy could prove both the absence of special values and cancellation in the kernel for the specified kernel ranges, '(✓)' indicates that only the absence of special values could be verified, and '✗' shows when Daisy reports a special-value warning. For the relatively small

benchmarks arclength, linearSVC and rayCasting, our verification of the kernels is sound, i.e. unconditional, as we used ranges computed by Astrée.

When Daisy reports a warning, it is not guaranteed that a kernel can actually compute a special-value result, because of 1) Daisy's over-approximation of the concrete program semantics, and because 2) the range we compute may contain values that are not feasible in the actual program execution. To help developers debug warnings reported by the static analyzer, we use CBMC on those kernels.

CBMC reports counterexamples in all kernels for which Daisy reports warnings. Upon code inspection, however, we identified the counterexamples of nbody and fbench to be spurious for the particular program inputs we consider. In these cases, the true kernel input range was discontinuous, and the counterexamples were reported for the infeasible inputs. In particular, in kernel 2 of nbody, a NaN could be produced if the two bodies that are simulated collide, which would not happen for the initial conditions that we chose. Similarly, the kernels in the ray-tracing algorithm of fbench could produce Infinity, if the ray was chosen in a very particular way. With the program input ranges we have chosen, this was impossible.

For linpack, the arithmetic overflow reported is indeed genuine, since a division by zero can occur before the kernel if the input matrix contains a zero on the diagonal, which leads to undefined behavior and the huge range of the kernel inputs. Similarly, for molecular and reactor, arithmetic overflow can occur for a specific position of molecules and a specific value of the angle between particle's direction and the X-axis, respectively.

We note that given the counterexamples produced by CBMC, we could straight-forwardly identify the warnings as spurious or genuine. In future work, one could consider refining the kernel monitoring, such that it would not only track a single range per kernel but could detect discontinuous ranges.

Our extension of Daisy reports cancellation-error warnings for one kernel of linearSVC and one kernel of lulesh. We have used a threshold of 10<sup>3</sup> for reporting cancellation, i.e. if the relative errors of the operands and the result differ by more than three orders of magnitude, we report an error. We inspected the kernel code and confirmed that the cancellation warnings are genuine, i.e. there are indeed inputs that will result in a large roundoff error. The number of cancellations found may seem small. We suspect that this is the case, because our benchmarks were mostly written as reference or example programs (e.g. lulesh was developed to be a representative hydrodynamics simulation code), hence we expect them to be carefully developed and tested.

*Kernel Optimization* We have additionally applied Daisy's rewriting optimization on those kernels for which Daisy does not report possible special values. With this procedure, we could reduce the roundoff errors in 8 of the kernels out of which 6 cases are notable. We could reduce the error by 9.5% for linearSVC, 7.1% and 3.3% for two outputs of kernel 2 in pendulum, by 19.8%, 4.0%, 5.8%, and 5.8% for different kernel outputs of lulesh, and by 33.3% for one output of molecular. From these experimental results, we conclude that the ranges that we inferred in the first phase are actually useful for kernel analysis.

#### 7 Related Work

Abstract interpretation-based techniques are in principle uniquely suitable for verifying the absence of special values and safety in floating-point programs. We have chosen Astrée [63] in this work because it is an industrial-strength tool, and as such, supports a wide range of C programs and is designed for scalability. Apron [50] is a library of numerical abstract domains that are sound w.r.t. floatingpoint arithmetic, and includes, for instance, the domain of polyhedra [19], which is, however, significantly more expensive than the interval arithmetic domain that we use. ELINA [71] provides performance-optimized implementations of many numerical abstract domains, but its polyhedra domain does not support floating-point arithmetic.

These domains only bound variable values; abstract domains [43,33,31,30] or optimization-based static analyses [60,65,72] for bounding roundoff errors provide nontrivial results only for relatively small kernels. For the second step in our framework, we could have in principle chosen any of these tools; we chose Daisy because we found it easy to modify for our needs, and because it already includes the rewriting optimization.

In the space of deductive verification, besides Frama-C [24], the Boogie intermediate verification language [53] also has support for floating-point arithmetic and discharges the verification conditions using the Z3 SMT solver. Similarly, bounded model checking [52] is limited by the performance of the underlying SAT/SMT solvers. While the floating-point support in today's SMT solvers [17,16] has improved significantly in recent years, it is still limited to relatively few arithmetic expressions.

Many interactive theorem provers have floating-point formalizations [49,15,37]. While these do allow to prove complex functional properties [13,14,46], the proofs are largely manual and require significant expertise.

Blackbox testing has been explored to find large roundoff errors by executing a higher-precision version of the program side-by-side [10,21,77]. Recently, whitebox testing has been used for detecting overflows [38], by phrasing the search as a mathematical optimization problem, and large roundoff errors [21,78], by adapting the notion of condition numbers. KLEE-Float [57], FPGen [44] and Ariadne [9] use symbolic execution for finding bugs in floating-point code, including overflows and large precision loss and cancellation. While KLEE-Float relies on the floatingpoint SMT decision procedures, Ariadne approximates the path constraints and uses the real-valued theory. FPGen injects specialized inaccuracy checks to find cancellations. Only FPDebug [10] has been shown to scale beyond numerical kernels and, to the best of our knowledge, none of the dynamic techniques have been used to obtain range information.

Once a large roundoff error has been identified, Herbgrind [69] can help to locate its root cause, which may be in a different instruction than where the error becomes significant. Herbgrind is thus complementary to our work and may be used to locate root causes of potential cancellation errors reported by Daisy.

Rewriting floating-point expressions in order to optimize roundoff errors has been explored in the tool Herbie [67] and others [74,76]. These approaches attempt to repair unstable code, checking accuracy using a dynamic analysis. They are alternatives to using Daisy for the second step in our framework. Alternative program optimizations that we have not explored in this work, but that also require range information, include mixed-precision tuning [32,20,68] and general non-semantics preserving approximation [70].

Apart from AFLGo [12], there is a wide range of targeted greybox fuzzers, such as those targeting specified program locations [18], rare branches [54], unexplored branches [55,73], or potential vulnerabilities [39,45,22,56]. In our setting, we require fuzzers like AFLGo to target the specific program locations of kernels.

There is a significant body of work on guiding program analyzers. In particular, test case generation is typically guided by a static analysis toward specific parts of the code (e.g., [27,35,66,41,40,58,62,28,59,23,36,34,75,44]). Our approach is similar to these techniques as it infers input ranges to guide verifiers of numerical kernels toward those kernel executions that are relevant in the context of the containing application.

#### 8 Conclusion

Even though floating-point programs have received a lot of attention recently, their focus has been largely on verifying or debugging arithmetic kernels. Our review of existing techniques and tools has shown that few approaches with specific floating-point support are applicable to whole programs without significant user expertise. We have found, however, that standard greybox fuzzing proved to be effective in detecting overflows and NaNs. Meanwhile, static-analysis techniques to show the absence of special values and cancellation errors remain limited to programs with few bounded loops and numerical kernels, respectively.

Instead of trying to scale up existing roundoff-error analysis tools to whole programs, we *combine* them with more scalable analyses that compute the kernel preconditions needed for the roundoff analyses to work. We showed how relatively small adaptations to well-known techniques of directed blackbox and greybox fuzzing are enough to realize such a framework. Together with modifications to an existing roundoff-error analyzer, we are able to *conditionally verify* the absence of special values and cancellations in a number of numerical kernels in realistic floating-point programs that are out of reach for today's analyses. At the same time, our analysis is precise enough to identify several cases of cancellations. While our approach is not suitable and not intended for certification of safety-critical systems, we believe that it nonetheless provides valuable debugging feedback for many real-world applications.

#### Acknowledgements

This research was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) project 387674182 and project 389792660 as part of TRR 248 (see https://perspicuous-computing.science). We also thank Dr.-Ing. Jörg Herter from AbsInt for the training and assistance with Astrée.

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Symbolic Coloured SCC Decomposition***-*

Nikola Beneˇs , Luboˇs Brim, Samuel Pastva, and David Safr´anek ˇ

Faculty of Informatics, Masaryk University, Brno, Czech Republic {xbenes3,brim,xpastva,safranek}@fi.muni.cz

**Abstract.** Problems arising in many scientific disciplines are often modelled using edge-coloured directed graphs. These can be enormous in the number of both vertices and colours. Given such a graph, the original problem frequently translates to the detection of the graph's strongly connected components, which is challenging at this scale.

We propose a new, symbolic algorithm that computes all the monochromatic strongly connected components of an edge-coloured graph. In the worst case, the algorithm performs O(p · n · log n) symbolic steps, where p is the number of colours and n the number of vertices. We evaluate the algorithm using an experimental implementation based on Binary Decision Diagrams (BDDs) and large (up to 2<sup>48</sup>) coloured graphs produced by models appearing in systems biology.

**Keywords:** strongly connected components · symbolic algorithm · edge-coloured digraphs · systems biology

#### **1 Introduction**

Processing massive data sets poses a series of interesting computational challenges. A variety of these data sets can be modelled as very large multigraphs, augmented by a specific collection of application-dependent edge attributes. These attributes are often represented as colours and the resulting formalism is called an edgecoloured graph [4, 10]. Geographic information systems, telecommunications traffic, or internet data are prime examples of data that are best represented as such edgecoloured graphs. For instance, in social networking, it is typically used to identify groups of nodes related to each other by some specific criteria (Sports, Health, Technology, Religion, etc.) represented as colours. Our interest in processing huge edge-coloured graphs is primarily motivated by applications taken from systems biology [5, 29] and genetics [25] where we have to deal not only with giant graphs as measured by the number of vertices and edges but also with large sets of colours. The colours in such graphs represent various parameters that influence the dynamics of a biological system [5, 9, 46].

Fundamental graph algorithms such as breadth-first search, spanning tree construction, shortest paths, decomposition into strongly connected components

<sup>-</sup>Supported by the Czech Science Foundation grant No. 18-00178S.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 64–83, 2021. https://doi.org/10.1007/978-3-030-72013-1 4

(SCCs), etc., are building blocks of many practical applications. For the edgecoloured graphs, the primary research focus so far has been on some of the "classical" coloured graph problems, like the determination of the chromatic index, finding sub-graphs with a specified colour property (the coloured version of the k-linked problem), properly edge-coloured cycles and paths, alternating cycles, rainbow cliques, monochromatic cliques, monochromatic cycles, etc. [1–4, 55, 33].

To the best of our knowledge, we are not aware of any work on SCC decomposition for edge-coloured graphs, even though this problem has many important applications. For example, in biological systems, connected components represent the attractors of the system. These play an essential role in determining the system's properties, since they may correspond, for example, to the specific phenotypes of a cell [21]. The parameters (e.g. reaction rates) in such systems might be represented as edge colours in the state transition graph. The knowledge of attractors and how their structure depends on parameters is vital for understanding various biological phenomena [24, 38]. Other applications where investigation of attractors is crucial include predictions of the global climate change [52], or predictions of spreading of infectious diseases such as COVID-19 [39].

There is a serious computational problem related to the processing of massive edge-coloured graphs, even the non-coloured ones, that significantly affects the tractability of SCC decomposition. The graphs often cannot be handled with standard (explicit) representations since they are too large to be kept in the main memory. Various approaches have been considered to deal with such giant graphs: distributed-memory structures, structures for representing graphs symbolically, or storing the graphs in external memory. We review these approaches in more detail in the related work section.

In [6, 13] we have initially attacked the SCC decomposition problem for massive edge-coloured graphs by developing a parallel semi-symbolic algorithm for detecting terminal SCCs. The algorithm uses symbolic structures to represent sets of parameters, while the graph itself is represented explicitly. The results have shown that the parallel semi-symbolic algorithm is not sufficient for the practical needs to tackle large graphs representing real-world problems. Those findings have motivated us to propose an entirely symbolic approach.

In this paper, we consider edge-coloured multi-digraphs, i.e., multi-digraphs such that each directed edge has a colour and no two parallel (i.e., joining the same pair of vertices) edges have the same colour. Here, we refer to such graphs simply as coloured graphs. For coloured graphs, we can define several notions of strongly connected components involving colours. We consider the simplest case, where the SCCs are monochromatic, that is all their edges have the same colour [35]. This choice is motivated by the application in systems biology, as mentioned above.

We propose a novel fully symbolic algorithm for detecting all monochromatic components in coloured graphs which is in practice significantly faster than is achievable with a na¨ıve execution of an algorithm for symbolic SCC decomposition scanning all colours one-by-one, in particular on massive coloured graphs. This is because in many applications, the edges are largely shared among

individual colours [5] and our algorithm is capable of exploiting this fact. The algorithm conceptually follows the lock-step reachability approach by Bloem [14] for monochromatic digraphs. The key new ingredients behind our algorithm are a careful orchestration of the forward and backward reachability for different colours and a sophisticated selection of a set of pivots.

#### **1.1 Related Work**

The detection of SCCs in (monochromatic) digraphs is a well-known problem computable in linear time. Best serial (explicit) algorithms are Kosaraju-Sharir [50] and Tarjan [53], which are both inherently based on depth-first search. However, these algorithms do not scale for large graphs, e.g., those encountered in modelchecking. Therefore, alternative approaches to SCC decomposition have been proposed (I/O efficient, parallel, symbolic algorithms).

The algorithm of Jiang [32] gives an I/O-efficient alternative based on a combination of depth-first and breadth-first search.

Efficient parallel distributed-memory algorithms avoid the inherently sequential DFS step [45] in several different ways. The Forward-Backward algorithm [26] employs a divide-and-conquer approach relying on picking a pivot state and splitting the graph in three independent (no crossing SCCs) parts. The approach of Orzan [44] uses a different distribution scheme called a colouring transformation employing a set of prioritised colours to split the graph into many parts at once. The recursive OWCTY-Backward-Forward (OBF) approach is proposed in [8]. It recursively splits the graph in a number of independent sub-graphs called OBF slices and applies to each slice the One-Way-Catch-Them-Young (OWCTY) technique. In [51] the authors utilise variants of Forward-Backward and Orzan's algorithms for optimal execution on shared-memory multi-core platforms. Finally, Bloemen et al. [15] utilise the important ability of Tarjan's algorithm to return detected SCCs on-the-fly. In particular, they present an on-the-fly parallel algorithm showing promising speedups for large graphs containing large SCCs. On another end, GPU-accelerated approaches to computing SCCs have been addressed, e.g., in [7, 30, 37, 56].

Computing SCCs of (monochromatic) digraphs symbolically is another way to handle giant graphs and has been thoroughly explored in literature. As in the case of efficient parallelisation, depth-first search is not feasible in the symbolic framework [28]. In consequence, many DFS-based algorithms cannot be easily revised to work with symbolically represented graphs. An algorithm based on forward and backward reachability performing <sup>O</sup>(n<sup>2</sup>) symbolic steps was presented by Xie and Beerel in [57]. Bloem et al. present an improved O(n · log n) algorithm in [14]. Finally, an O(n) algorithm was presented by Gentilini et al. in [27, 28]. This bound has been proved to be tight in [20]. In [20], the authors argue that the algorithm from [27] is optimal even when considering more finegrained complexity criteria, like the diameter of the graph and the diameter of the individual components. Ciardo et al. [59] use the idea of saturation [22] to speed up state exploration when computing each SCC in the Xie-Beerel algorithm, and compute the transitive closure of the transition relation using a novel algorithm

based on saturation. Besides these generic algorithms, there have been recently also proposed symbolic SCC decomposition methods to deal with specific large graphs, e.g., graphs generated by Boolean networks [42, 58].

#### **2 Problem Definition**

As we have already stated in the introductory section, the SCC decomposition problem for edge-coloured graphs has remained mostly unexplored until now. We thus start this paper by introducing and formalising the notion of coloured SCC decomposition itself and state some of its basic properties.

Before giving exact definitions, it might be instructive to discuss the substance of the coloured SCC decomposition intuitively. There are several ways of capturing the notion of a "coloured connected component". For example, one of them is that of a colour-connectivity first introduced by Saad [47]. It is based on alternating paths in which successive edges differ in colour. However, there is no unique, universally acceptable notion of a coloured component.

In the biological application we have in mind, we want to identify a coloured component as a coloured collection of SCCs—a collection where for every colour there is a set of all relevant monochromatic SCCs. Such setting leads us to represent SCCs in the form of a relation. To that end, we first introduce such a relation for monochromatic graphs (Section 2.1) and consequently extend it to edge-coloured graphs (Section 2.2). The relation-based approach gives us also the advantage of allowing a feasible symbolic encoding of the problem.

#### **2.1 Graphs and Strongly Connected Components**

Let us first recall the standard definitions of a directed graph and its strongly connected components:

**Definition 1.** A directed graph is a tuple G = (V,E) where V is a set of graph vertices and E ⊆ V × V is a set of graph edges.

We are going to use the word graph to mean directed graph in the following. We write u → v when (u, v) ∈ E and u →<sup>∗</sup> v when (u, v) ∈ E∗, the reflexive and transitive closure of E. We say that v is reachable from u if u →<sup>∗</sup> v. The reachability relation allows us to decompose a graph into strongly connected components, defined as follows:

**Definition 2.** In a graph G = (V,E), a strongly connected component (SCC) is a maximal set W ⊆ V such that for all u, v ∈ W, u →<sup>∗</sup> v and v →<sup>∗</sup> u. For a fixed v ∈ V , we write SCC(G, v) to denote the SCC of G that contains v.

If the graph G is clear from the context, we can simply write SCC(v). A set of vertices S ⊆ V is said to be SCC-closed if every SCC W is either fully contained inside S (W ⊆ S), or in its complement (W ⊆ V \ S). Notice that given a vertex v, the set of all vertices reachable from v, as well as the set of all vertices that can reach v, are both SCC-closed.

A pivotal problem in computer science is to find the SCC decomposition of G. As mentioned above, we represent the decomposition in the form of an equivalence relation Rscc such that the individual SCCs are exactly the equivalence classes of Rscc. The relation-based formulation of the SCC decomposition problem is the following:

**Problem 1 (SCC decomposition)** Given a graph G = (V,E), find the SCC decomposition relation Rscc ⊆ V × V such that (u, v) ∈ Rscc if and only if SCC(u) = SCC(v).

Note that SCC(u) is the section of the first attribute of Rscc, i.e. SCC(u) = {u | (u, v) ∈ Rscc}. We denote such a section in the following way: SCC(u) = Rscc(u, ). Here, u is the specific value of an attribute at which the section is taken, and is used in place of the attributes that remain unchanged. Such notation naturally extends to relations of arbitrary arity.

#### **2.2 Coloured SCC Decomposition Problem**

We now lift the formal framework to the coloured setting. An edge-coloured graph can be seen as a succinct representation of several different graphs, all sharing the same set of vertices. Note that to emphasise the difference from the standard graphs as given in Definition 1, we sometimes call the standard graphs monochromatic.

**Definition 3.** An edge-coloured directed multi-graph (coloured graph for short) is a tuple G = (V, C, E) where V is a set of vertices, C is a set of colours and E ⊆ V × C × V is a coloured edge relation.

We also write u <sup>c</sup> −→ v whenever (u, c, v) ∈ E. By fixing a colour c ∈ C and keeping only the c-coloured edges (with the colour attribute removed), we obtain a monochromatic graph G(c)=(V, {(u, v) | (u, c, v) ∈ E}). We call this graph the monochromatisation of G with respect to c. Intuitively, one can view the elements of C as a type of graph parametrisation where the edge structure of the graph changes based on the specific c ∈ C.

The SCC decomposition relation Rscc is extended to the coloured SCC decomposition relation Rscc by relating every colour c ∈ C with all SCCs of the monochromatisation G(c). In consequence, the SCC decomposition problem is then lifted to the coloured SCC decomposition problem as follows:

**Problem 2 (Coloured SCC decomposition)** Given a coloured graph G = (V, C, E), find the coloured SCC decomposition relation Rscc ⊆ V × C × V satisfying (u, c, v) ∈ Rscc if and only if (u, v) ∈ Rscc of G(c).

From this definition, we can immediately observe the following properties about the relationship of Rscc with the terms which we have defined before:


From this, it should be immediately clear that Rscc contains all components of the underlying monochromatisations.

#### **3 Algorithm**

Conceptually, our algorithm follows the lock-step reachability approach by Bloem [14] for monochromatic graphs. The lock-step algorithm itself is based on the basic forward-backward decomposition algorithm [57]. In this section, we first briefly introduce these two algorithms in order to explain better the key ideas behind our approach and, in particular, to explain what were the main difficulties encountered in employing the concepts of these algorithms to edge-coloured graphs. Although the algorithms were originally presented as producing a set of SCCs, we reformulate them slightly using the equivalent relation-based approach as explained in the previous section. After that, we present the coloured SCC decomposition algorithm. However, before we dive into the algorithmics, let us first briefly discuss the computation model we are using.

#### **3.1 Symbolic Computation Model**

As a complexity measure of our algorithm, we consider the number of symbolic steps, or more specifically, symbolic set and relation operations that the algorithm performs. As is customary, we assume that sets of vertices (V ) and colours (C) can be represented symbolically (for example, using reduced ordered binary decision diagrams [17]) as well as any relations over these sets. In particular, we often talk about coloured vertex sets, by which we mean the subsets of V × C.

Aside from normal set operations (union, intersection, difference, product and element selection), we also require some basic relational operations, all of which we outline in Fig. 1. These extra operations tend to appear in other applications as well (such as symbolic model checking [18]), and are thus typically already available in mature symbolic computation packages.

Finally, there are several derived operators that are partially specific to our application to coloured graphs. However, these can be constructed using standard set and relation operations. The intuitive meaning of the derived operators is as follows: Colours returns all the colours that appear in the given coloured vertex set. Pre and Post compute the pre and post-image of a (monochromatic or coloured) set of vertices, i.e. the set of successors or predecessors of all the vertices in the given set, respectively. Finally, Join takes a coloured vertex set A and computes the set {(u, c, v) | (u, c) ∈ A,(v, c) ∈ A}.

#### **3.2 Forward-backward Algorithm**

To symbolically compute the SCCs of a graph G = (V,E), Xie and Beerel [57] observed that for any vertex v ∈ V , the intersection W = F ∩ B of the forward reachable vertices F = {v ∈ V | v →<sup>∗</sup> v } and the backward reachable vertices B = {v ∈ V | v →<sup>∗</sup> v} is exactly the strongly connected component of G which contains v.

The algorithm thus picks an arbitrary pivot v ∈ V , and divides the vertices of the graph into four disjoint sets: W, F\W, B\W and V \(F∪B). This is illustrated graphically in Fig. 2 (left). The set W is then immediately reported as an SCC


**Fig. 1.** Summary of symbolic operations that appear in the presented algorithms. The derived operations can be implemented using the standard and relational operations. However, typically they also have a slightly more efficient direct implementations.

of the graph, and added into the component relation: Rscc ← Rscc ∪ (W × W). It is easy to see that every other SCC is fully contained within one of the three remaining sets (they are SCC-closed), and the algorithm thus recursively repeats this process independently in each set.

The correctness of the algorithm follows from the initial observation and the fact that every vertex eventually appears in W (either as a pivot or as a result of F ∩ B). In the worst case, the algorithm performs O(|V | <sup>2</sup>) symbolic steps, since every vertex is picked as a pivot at most once and the computation of F and B requires at most <sup>O</sup>(|<sup>V</sup> <sup>|</sup>) Pre/Post operations.

#### **3.3 Lock-step Algorithm**

To improve the efficiency of the forward-backward algorithm, the lock-step approach [14] uses another important observation: To compute W, it is not necessary to fully compute both F and B; only the smaller (in terms of diameter) of the two sets needs to be entirely known. With this observation, the computation of F and B can be modified in the following way: Instead of computing F and B one after the other, the computation is interleaved in a step-by-step manner (dovetailing). When one of the sets is fully computed, the computation of the second set is stopped. Let us call the computed set converged and denote it by

**Fig. 2.** Illustration of the difference between the forward-backward algorithm (left) and the lock-step algorithm (right). On the left, we fully compute both backward (B) and forward (F) reachable sets from the pivot v, identifying W as F ∩ B. On the right, without loss of generality, assume F is fully computed first. It thus becomes converged (Con) and the computation of B (Non) is stopped before it is fully explored.

Con, and the unfinished set non-converged and denote it by Non. This situation is illustrated in Fig. 2 (right).

However, even when Con is fully known, we still need to finish the computation of states in Non that are inside Con to discover the whole component W. This is necessary if there are vertices w in W whose forward distance from v (i.e. the length of the path v →<sup>∗</sup> w) is short while their backward distance (the length of the path w →<sup>∗</sup> v) is long, or vice versa. Such vertices are thus only discovered in one of the two reachability procedures and still need to be discovered by the other one to identify the whole component. However, an important observation is that only the vertices already inside Con need to be considered in this step.

After this, the SCC can be identified and reported just as in the forwardbackward algorithm. Finally, the recursion now continues in sets Con \ W and V \ Con. This is due to Non being not fully computed; we cannot guarantee that no SCC overlaps outside of Non (Non is not necessarily SCC-closed).

The algorithm is still correct because every vertex is eventually either picked as a pivot or discovered in some W. Furthermore, due to the way Con and Non are computed guarantees that W is still a whole SCC. In terms of complexity, the algorithm performs O(|V | · log |V |) symbolic steps in the worst case. To see why this is true, we may observe that every vertex appears in W exactly once, and that the smaller of the two sets Con \ W and V \ Con, let us call it S, is always smaller than <sup>|</sup><sup>V</sup> <sup>|</sup> <sup>2</sup> . The authors then argue that the price of every iteration can be attributed (up to a multiplicative constant) to the vertices in S ∪ W and that every vertex appears in S at most O(log |V |)-times.

#### **3.4 Coloured Lock-step Algorithm**

When developing an algorithm for coloured graphs, we had to deal with multiple challenges which do not appear for monochromatic graphs and require careful consideration. In the following, we refer to the pseudocode in Algorithm 1.

An important observation is that the structure of components in the graph can change arbitrarily with respect to the graph colours. In consequence, our algorithm

#### **Algorithm 1:** Symbolic Coloured SCC Decomposition

 **Function** ColouredSCC(G = (V, C, E)) Rscc ⊆ (V × C × V ) ← ∅; Decomposition(G, <sup>R</sup>scc, V <sup>×</sup> <sup>C</sup>); **return** Rscc; **Function** Decomposition(<sup>G</sup> = (V, C, E), <sup>R</sup>scc <sup>⊆</sup> (<sup>V</sup> <sup>×</sup> <sup>C</sup> <sup>×</sup> <sup>V</sup> ), V ⊆ (<sup>V</sup> <sup>×</sup> <sup>C</sup>)) **if** V = ∅ **then return**; F, B, −→F , −→B ⊆ (<sup>V</sup> <sup>×</sup> <sup>C</sup>) <sup>←</sup> Pivots(V); −→Fu, −→B<sup>u</sup> <sup>⊆</sup> (<sup>V</sup> <sup>×</sup> <sup>C</sup>) ← ∅; Flock , Block ⊆ C ← ∅; **while** <sup>F</sup>lock <sup>∪</sup> <sup>B</sup>lock <sup>⊂</sup> Colours(V) **do** −→F ⊆ <sup>V</sup> <sup>×</sup> <sup>C</sup> <sup>←</sup> (Post(G, −→F ) ∩ V) \ F; −→B ⊆ <sup>V</sup> <sup>×</sup> <sup>C</sup> <sup>←</sup> (Pre(G, −→B ) ∩ V) \ B; <sup>F</sup>lock <sup>←</sup> <sup>F</sup>lock <sup>∪</sup> (Colours(V) \ Colours( −→F )); <sup>B</sup>lock <sup>←</sup> <sup>B</sup>lock <sup>∪</sup> (Colours(V) \ Colours( −→B ) \ <sup>F</sup>lock ); −→F<sup>u</sup> <sup>←</sup> −→F<sup>u</sup> <sup>∪</sup> (F ∩ (<sup>V</sup> <sup>×</sup> <sup>B</sup>lock )); −→B<sup>u</sup> <sup>←</sup> −→B<sup>u</sup> <sup>∪</sup> (B ∩ (<sup>V</sup> <sup>×</sup> <sup>F</sup>lock )); −→F ← −→F \ (<sup>V</sup> <sup>×</sup> <sup>B</sup>lock ); −→B ← −→B \ (<sup>V</sup> <sup>×</sup> <sup>F</sup>lock ); F←F∪ −→F ; B←B∪ −→B ; **end** Con ⊆ V × C ← (F ∩ (V × Flock )) ∪ (B ∩ (V × Block )); −→F ← −→F<sup>u</sup> ∩ Con; −→B ← −→B<sup>u</sup> ∩ Con; **while** −→F <sup>=</sup> ∅ ∧ −→B <sup>=</sup> <sup>∅</sup> **do** −→F ← (Post(G, −→F ) ∩ Con) \ F; −→B ← (Pre(G, −→B ) ∩ Con) \ B; F←F∪ −→F ; B←B∪ −→B ; **end** W ⊆ V × C ←F∩B; <sup>R</sup>scc <sup>←</sup> <sup>R</sup>scc <sup>∪</sup> Join(W); Decomposition(G, <sup>R</sup>scc, V\Con); Decomposition(G, <sup>R</sup>scc, <sup>C</sup>on \ W); **Function** Pivots(V) P ⊆ (V × C) ← ∅; V ⊆ (V × C) ← V; **while** V = ∅ **do** (v, c) <sup>←</sup> Pick(V ); P←P∪ ({v} × σ1(v, V )); <sup>V</sup> ← V \ (<sup>V</sup> <sup>×</sup> Colours(P)); **end return** P;

cannot simply operate with sets of graph vertices as the normal algorithm would. To that end, we use the notion of coloured vertex sets as introduced in Section 3.1 where the symbolic operations we perform on these sets have been described.

Initially, the algorithm starts with all vertices and colours, i.e. the full set V × C. However, as the components are discovered, the intermediate results may contain different vertices appearing only for certain subsets of C. As a result, we often cannot pick a single pivot vertex that would be valid for all considered colours. Instead, we aim to pick a pivot set P ⊆ V × C such that for every colour that still appears in V, the set contains exactly one vertex. Alternatively, one can also view the pivot set as a (partial) function from C to V . This is done in the Pivots function.

The lock-step reachability procedure also cannot operate as in a standard graph. First of all, there can be colours where the forward reachability converges first, as well as colours where this happens for backward reachability. The algorithm thus has to account for both options simultaneously. Second, for each colour, the reachability can converge in a different number of steps. To deal with this problem, we introduce the Flock and Block variables. These store the mutually disjoint sets of colours for which forward and backward reachability already converged. The lock-step procedure terminates when Flock and Block contain all the colours that appear in V.

Throughout the algorithm, we keep track of several coloured-set variables. The first two, F and B, represent the forward and backward reachable sets, respectively. We then have four variables −→F , −→Fu, −→B , −→B<sup>u</sup> to represent the frontiers of these sets, i.e., the set of pairs (v, c) such that the vertex v has not yet been expanded in the corresponding reachability procedure for the colour c. The frontier of <sup>F</sup> is the set −→F ∪ −→Fu. The sets −→F and −→F<sup>u</sup> contain disjoint colours – −→F involves those colours for which the lock-step reachability procedure has not finished yet, while −→F<sup>u</sup> represents the unfinished part of the frontier that shall be explored in the second while cycle; similarly for −→B and −→Bu.

In the first while cycle (lines 10–21), we compute the reachability sets in the lock-step manner. Once a reachability set is completed for some colours (i.e., there are no vertices to expand with those colours), we add the colours to the corresponding Flock or Block variable. Note that we ensure that Flock and Block remain disjoint even if the two reachability procedures converged at the same time for certain colours—see line 14. We use Flock and Block to split the newly computed frontier sets into the parts that are to be explored in the next iteration (−→F , −→B ) and the parts that are currently left unfinished (−→Fu, −→Bu).

After the first while cycle, we compute the set Con that is an analogue for the converged set of the original lock-step algorithm (line 22). As already suggested above and unlike the original algorithm, this set cannot be just F or B, but is instead a mixture of both, depending on the convergent colours. To compute this set, we use the Flock and Block variables.

The second while cycle (lines 25–30) then completes the unfinished forward and backward reachability set, restricted to the inside of the converged set. The intersection of F and B then forms a coloured set W with the property that for all <sup>c</sup> <sup>∈</sup> Colours(V), <sup>W</sup>( , c) is a strongly connected component of <sup>G</sup>(c). We create the corresponding relation using the Join operation, add this relation to the resulting Rscc, and recursively call the whole procedure with V\Con and Con \ W as the base coloured sets of vertices.

Let us note that there is possibly another approach. Instead of trying to work with all colours still appearing in the coloured vertex set at once, we cold fork a new recursive procedure whenever the colour set splits due to the differences in the graph structure. For example, instead of picking multiple coloured vertices as pivots, one could pick a single vertex with a valid subset of colours and then address the remaining colours in a separate recursive call. While such approach could be to some extent beneficial in a massively parallel environment where each recursive call can be executed independently on a new CPU, the amount of forking in large systems will soon become unreasonable. More importantly, it defeats the purpose of symbolic representation which aims to minimise the number of symbolic operations.

#### **3.5 Correctness and Complexity of the Coloured Lock-step Algorithm**

**Theorem 1.** Let G = (V, C, E) be a coloured graph. The coloured lock-step algorithm terminates and computes the coloured SCC decomposition relation Rscc.

Proof. We first show that the set W computed on line 31 indeed contains one SCC for every colour <sup>c</sup> <sup>∈</sup> Colours(V) and that the recursive calls of Decomposition preserve the property that V is SCC-closed with respect to all colours.

Let us assume that V is SCC-closed and let us take an arbitrary c ∈ Colours(V). The function Pivots chooses a set that contains exactly one pair whose colour is c, let us call this pair (v, c). Let us further assume that c is assigned into Flock first (the case with Block is completely symmetric).

Let us now choose an arbitrary vertex w such that v and w are in the same SCC of G(c), i.e. v →<sup>∗</sup> w and w →<sup>∗</sup> v. As the first while cycle finishes, F contains all the pairs of the form (u, c) ∈ V where u is reachable from v in G(c). Thus, it also contains (w, c) due to the fact that V is SCC-closed. Now, either (w, c) ∈ B, or there exists a vertex x such that w →<sup>∗</sup> x, x →<sup>∗</sup> v in G(c) and x ∈ −→Bu. This means that (w, c) is added to B in the second while cycle. In both cases, both (v, c) and (w, c) are then added to W. As the vertex choices were arbitrary, this proves that the SCC of v in G(c) is contained in W. Furthermore, if (y, c) ∈ W for an arbitrary y, then v →<sup>∗</sup> y and y →<sup>∗</sup> v in G(c), which means that y is in SCC(G(c), v). This proves that W contains exactly one SCC for every colour <sup>c</sup> <sup>∈</sup> Colours(V).

We now argue that Con is SCC-closed with respect to all colours. This immediately implies that both V\Con and Con \ W are SCC-closed. Let us assume that there is a colour <sup>c</sup> <sup>∈</sup> Colours(V) and two vertices <sup>v</sup>, <sup>w</sup> in the same SCC of G(c) such that (v, c) ∈ Con, but (w, c) ∈ Con. Let us assume that c ∈ Flock (as above, the case of Block is completely symmetrical). Then (v, c) ∈ F

after the first while cycle finishes. This also means that (w, c) ∈ F as the forward reachability procedure is completed for c and thus (w, c) ∈ Con, a contradiction.

What remains is to show that the algorithm terminates and that every SCC is eventually found. Termination is trivially proved by the fact that size of the set V always decreases in recursive calls: both W and Con are nonempty, because they contain the initial pivot set as a subset. Clearly, a representant of every SCC of every monochromatisation G(c) is eventually chosen as a pivot. Together with the above reasoning, this implies that the algorithm is correct.

**Theorem 2.** Let |V | be the number of vertices in the coloured graph and let |C| be the number of colours. The coloured lock-step algorithm performs at most O(|C|·|V | · log |V |) symbolic steps.

Proof. Let us first note that all the derived operations defined in Fig. 1 use only a constant number of the basic symbolic operations. As we are considering asymptotic complexity here, we can view all the operations in Fig. 1 as elementary symbolic steps.

We first make the observation that each vertex may be chosen as a part of the pivot set at most |C| times. Clearly, once a vertex is included in the pivot set with a set of colours C , then, {v} × C ⊆ Con (due to the monotonicity of the construction of F and B) and the elements of {v} × C do not appear in subsequent recursive calls. This means that the total complexity of the calls to Pivots is bounded by <sup>O</sup>(|C|·|<sup>V</sup> <sup>|</sup>) and we can exclude the calls from the rest of the complexity analysis.

We now consider the complexity of a single call to Decomposition without the subsequent recursive calls. Let us now select one of the colours for which the lock-step reachability procedure (lines 10–21) finished last, i.e., one of the colours that have been added to Flock or Block in the final iteration of the cycle. Let us call this colour c. Recall that σ2(c, X ) is the set of vertices with colour c in a coloured set X .

Let us denote by W := σ2(c, W) and let S be the smaller of σ2(c, V\Con) and σ2(c, Con \ W). Clearly S contains at most |V |/2 vertices. Let k = |S ∪ W|. We now argue that the number of symbolic steps in a given call (without the recursive calls) is bounded by O(k).

Assume w.l.o.g. that c ∈ Flock (a completely symmetric argument solves the case c ∈ Block ). Then σ2(c, Con) = σ2(c, F). If S is σ2(c, Con \ W) then k is the size of σ2(c, F). Each iteration of the first while cycle puts at least one vertex with colour c into F; otherwise c would not be one of the last colours to finish. This means that the cycle runs for at most k iterations. This also means that the size of σ2(x, X ) for all colours x and X ∈ {F, B} is also bound by k, which in turn means that the second while cycle cannot make more than O(k) steps.

If S is σ2(c, V\Con) instead, let us define B := σ2(c, B) right after the first while cycle has finished. We know that B ⊆ S ∪ W: if a vertex v were in B \ S then (v, c) ∈ Con = F and thus v ∈ W. Again, each iteration of the first while cycle puts at least one vertex with colour c into B; otherwise c would have been in Block before it appeared in Flock . Similarly to the previous case, this means that both while cycles run for at most O(k) steps.

The rest of the argument uses amortised reasoning, in a way similar to the proof in [14]. Note that each vertex is going to be an element of the set W as described above at most |C| times (once for each colour). Furthermore, each vertex is going to be an element of the set S as described above at most |C|·log |V | times: for each colour, the vertex can be an element of the smaller of the two sets at most log |V | times. As the cost of each single call can be charged to the vertices in S ∪ W as explained above, it is sufficient to charge each vertex the total cost of |C| + |C| · log |V |. Together, this means that the total number of symbolic steps is bounded by O(|C|·|V | · log |V |).

Note that the upper bound established by Theorem 2 is no better than the one we would get if we split the coloured graph into its monochromatic constituents and processed each monochromatic graph separately using the original lock-step algorithm [14]. We remark, however, that the coloured approach is a heuristic whose real complexity might be much smaller. Indeed, the complexity analysis in the previous proof focused on a single colour, omitting the fact than SCCs for many other colours are found at the same time. In case where the edges are largely shared among the colours, which is true in many applications, the heuristic has the potential to significantly outperform the parameter-scan approach. The situation is similar to that of the coloured model checking; see the observations made in [5].

### **4 Experimental Evaluation**

In this section, we examine the applicability of our algorithm in real-world situations. First, we discuss how we implemented the algorithm and share some useful recommendations in this regard. We then look at how the implementation performs on real-life coloured graphs which are derived from large models considered in computational biology.

#### **4.1 Implementation**

As our symbolic set representation, we consider standard reduced ordered binary decision diagrams (ROBDDs, or just BDDs for short) [17]. The source of our edge-coloured graphs are the transition systems of parametrised Boolean networks (PBN) as understood in [11, 60].

**Boolean networks.** Normal (non-parametrised) Boolean networks [34, 46, 49, 54] appear in computational systems biology as logical models of complex biochemical processes [16]. Here, we use the asynchronous variant of BNs introduced by Thomas [54]. A Boolean network consists of Boolean variables, each having a Boolean update function. Update functions are executed non-deterministically and change the state of the Boolean variables. The semantics of such a network is a directed graph where the vertices are the possible valuations of the Boolean variables and the edges are induced by the non-deterministic execution of the update functions.

This type of models is especially challenging for symbolic analysis. It is a well-known fact, that using symbolic structures, like BDDs, to represent very large state spaces gives good results for synchronous systems, but shows its limits when trying to tackle asynchronicity (see e.g. [23]).

In the parametrised variant, the update functions can be partially unknown. This introduces a set of colours (parametrisations), each colour fully instantiating all update functions of the network. As a result, the semantics of such a model is an edge-coloured directed graph as we consider in this paper. For a full technical description of PBNs and their coloured graph semantics, please refer to [11].

Our implementation heavily relies on the existing internal libraries of our tool AEON [12], which at the moment partially supports symbolic analysis of PBNs. Specifically, AEON uses symbolic BDD-based representation of colour sets, but relies on explicit state space exploration. In this work, we extend these capabilities to fully symbolic analysis of the whole graph.

**Custom operations.** Aside from implementing the Post and Pre operations for a given PBN, we also choose to provide specialised implementations for the Colours and Pivots procedures. Especially for the Pivots procedure, this can greatly reduce the number of necessary symbolic steps, as we avoid picking pivots vertex-by-vertex.

To implement these two operations as efficiently as possible, we always order the Boolean variables in our BDDs starting from the colour and ending with vertex variables. This ensures that both Pivots and Colours can be implemented by pruning the vertex variable nodes and minimising the BDD.

Specifically, in this ordering, for Colours, all vertex nodes are effectively substituted with the true terminal node and the BDD is minimised. For Pivots, one (arbitrary) path of vertex variable nodes (corresponding to one pivot vertex) is preserved for every colour, and the rest of the vertex nodes are pruned.

**Trimming.** Finally, most graphs typically contain a large number of trivial SCCs that introduce unnecessary overhead to the main algorithm. To avoid this overhead, we additionally perform a trimming step before each invocation of Decomposition. Trimming consists of repeatedly removing all vertices which have no outgoing or no incoming edges and is employed by most symbolic SCC algorithms on standard directed graphs as well. The coloured analogue of trimming is straightforward, as it can be achieved using Pre and Post operations just as in the non-coloured case. For a coloured set of vertices <sup>V</sup>, Post(Pre(G, <sup>V</sup>)∩ V)∩ V returns only vertices which have at least one predecessor in V. The successor variant simply exchanges the Post and Pre operations.

#### **4.2 Experiments**

We evaluated our algorithm on 7 real-world networks based on the models from the Ginsim Boolean network database [19]. The experiments were performed on a 32-core AMD Ryzen workstation with 64GB of RAM memory. All tested models are available in our source code repository.<sup>3</sup> Note that the smaller models

<sup>3</sup> https://github.com/sybila/biodivine-lib-param-bn/tree/tacas

**Table 1.** Overview of the test models for the algorithm evaluation. The times (minutes:seconds) refer to the total runtime of the SCC decomposition procedure. The model variables and parameters give the number of Boolean variables necessary to represent the PBN symbolically. Finally, the graph size and colour set size specifies the magnitude of |V |·|C| and |C| for the coloured graph corresponding to the network.


(< 2<sup>30</sup>) should be easy to process even on a less powerful machine, however the larger models can require substantial amounts of RAM.

The PBNs and their analysis runtime is summarised in Table 1. For each network, we specify the number of Boolean variables used by symbolic encoding, separated into model variables (vertices) and model parameters (colours), and the actual approximate size of the coloured graph. Note that not all combinations of parameters (possible graph colours) are usually biologically admissible, and these are filtered out before the coloured SCC decomposition. Hence the size of the graph is smaller than the space of all the considered BDD variables.

From the presented results, we can draw the following observations: First, fully symbolic approach allows us to scale to much larger graphs than before, especially in terms of state space. Until now, AEON was typically limited (even for an easier problem of bottom SCC detection) to vertex counts of 2<sup>15</sup> <sup>−</sup> <sup>2</sup><sup>20</sup>, exhausting memory even for much smaller state spaces when dealing with complex parameter space. Here, we can easily handle up to 2<sup>30</sup> vertices with non-trivial parameter space and we hope to push this number even higher with further optimisations to our experimental implementation.

Second, the coloured heuristic is beneficial for symbolic computation. To support this claim, we considered a monochromatic variant of the decomposition problem for the WG Signaling Pathway and tested the basic lock-step algorithm on a collection of pseudo-random monochromatisations of this graph. Processing one such monochromatisation typically required 0.5 − 1 second. Considering the graph in question has 2359296 colours, processing the colours one-by-one would, even in ideal conditions, take well above 300 hours (more than 12 days).

#### **5 Conclusions**

In this paper we have presented a fully symbolic algorithm for detecting all monochromatic strongly connected components in edge-coloured graphs. The work has been motivated by systems sciences, namely systems biology, where the need for efficient automated analysis of components in large graphs with large sets of coloured edges is emergent. The algorithm combines several ideas inspired by existing state-of-the-art algorithms for SCC decomposition in a non-trivial way. We believe this is the first fully symbolic algorithm aiming to solve the problem efficiently.

The experimental evaluation has shown that in expected practical scenarios, the presented algorithm has a strong potential to be significantly faster than iterating a standard algorithm for SCC decomposition executed on all monochromatic sub-graphs one-by-one.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended

# **Case Studies**

### Local Search with a SAT Oracle for Combinatorial Optimization

Aviad Cohen, Alexander Nadel and Vadim Ryvchin

Intel Corporation, P.O. Box 1659, Haifa 31015, Israel

{aviad.cohen,alexander.nadel}@intel.com,vadimryv@gmail.com

Abstract. NP-hard combinatorial optimization problems are pivotal in science and business. There exists a variety of approaches for solving such problems, but for problems with complex constraints and objective functions, local search algorithms scale the best. Such algorithms usually assume that finding a non-optimal solution with no other requirements is easy. However, what if it is NP-hard? In such case, a SAT solver can be used for finding the initial solution, but how can one continue solving the optimization problem? We offer a generic methodology, called *Local Search with SAT Oracle (*LSSO*)*, to solve such problems. LSSO facilitates implementation of advanced local search methods, such as variable neighbourhood search, hill climbing and iterated local search, while using a SAT solver as an oracle. We have successfully applied our approach to solve a critical industrial problem of cell placement and productized our solution at Intel.

#### 1 Introduction

Real-life *combinatorial optimization problems* are pivotal in science, operations research, engineering, economics, and business [11, 13, 20, 21, 23].

Loosely speaking, an instance of a combinatorial optimization problem deals with the minimization of an objective function over a finite set, subject to *feasibility constraints* (or, simply, *constraints*). The set of all elements satisfying the constraints is referred to as the set of *feasible solutions* (or, simply, *solutions*). In this paper, we focus on solving any problem, which can be expressed as a constraint optimization program (COP) [2]. Arguably, the vast majority of combinatorial problems, encountered in practice, fall under this category.

Many important combinatorial problems are NP-hard. For such problems, various algorithmic strategies have been devised, including complete methods, such as branchand-bound and dynamic programming, and incomplete methods, such as greedy algorithms and local search. Each such method imposes requirements on the mathematical properties of the problem with a consequent limit on the scope of applicability.

Local search algorithms stand out from the rest in that they impose relatively mild constraints on the type of the problem to be addressed, thus providing a wide scope of applicability. Furthermore, they seem to scale better with input size relative to complete algorithms [24]. This makes local search algorithms an attractive choice. However, local search algorithms may return a low-quality solution or no solution at all, given a problem for which the mere task of finding a feasible solution is NP-hard. Henceforth, we shall refer to such problems as *NP-Hard-Feasible* problems.

This paper introduces the *Local Search with SAT Oracle (*LSSO*)* methodology, that is, local search algorithms which use a SAT solver (or a SAT-based optimization algorithm; details appear later) as an oracle. A key advantage of our approach is that it can handle problems with complex constraints and objective functions. In particular, it can handle NP-Hard-Feasible problems.

To see how SAT solvers might be useful, consider the basic version of a local search for an optimal solution. At the beginning, the local search generates an initial solution and sets it as the current solution. Then, it enters a loop. In each iteration, it looks for a solution with a lower value of the objective function *within a neighbourhood* of the current one. If such a solution is found, it is set to be the current solution, and the execution resumes. Otherwise, the algorithm terminates and returns the current solution.

A key component of local search algorithms is the *neighbourhood function*, which assigns to each feasible solution a subset of feasible solutions, called its *neighbourhood*. Ordinarily, a neighbourhood of the current feasible solution comprises a set of solutions which can be obtained from the current solution by applying a small collection of *feasibility-preserving* perturbations to its combinatorial structure. A key concern is ensuring that neighbourhoods: (i) are polynomially searchable, and (ii) contain high-quality solutions. However, meeting *both* requirements might be challenging, since polynomial searchability implies that neighbourhoods should be small, and hence less likely to contain high-quality solutions. In addition, in the case of NP-Hard-Feasible problems, it is not clear how to achieve polynomial searchability, since a search should, in particular, be able to find a feasible solution, which is NP-hard.

Our main idea is to let the SAT solver both find an initial solution and conduct the neighbourhood search. The designer can now define feasibility constraints and neighbourhoods declaratively, that is, by a set of SAT constraints. The designer has more freedom to choose neighbourhoods, which need neither be small, nor contain only solutions close to the current solution. This is because the search of the now complex and possibly large neighbourhoods is entrusted to SAT solvers, constructed precisely to efficiently search large complex subspaces. Our approach lends itself to implementations of advanced local search variants, such as variable neighbourhood search, hill climbing and iterated local search [29].

An important feature of our algorithms is that they are *anytime*. Recall that an anytime algorithm is expected to return a valid solution even if interrupted. An anytime algorithm for an optimization problem is expected to find an *improving* set of solutions. The anytime property is essential for industrial application, since it allows the user to get an approximate solution even for very difficult instances [14, 15].

We demonstrate the usefulness of our approach by solving hard industrial instances of the NP-Hard-Feasible cell placement problem. Cell placement is one of the most important problems in VLSI automation [28]. Its most basic version concerns placing without overlap a set of rectangles on a grid, while minimizing the occupied area. In reality, the problem is more complex. Our approach has been successfully productized at Intel.

The rest of this paper is organized as follows: Sect. 2 provides the necessary background. Sect. 3 introduces our LSSO methodology. Sect. 4 shows how to solve placement with LSSO. Sect. 5 presents the experimental results. Sect. 6 concludes our paper.

#### 2 Background

This section provides some background. Sect. 2.1 is an overview of COP. Sect. 2.2 describes the cell placement problem and shows how to reduce it to COP. Sect. 2.3 discusses how one can solve a COP using a SAT-based bit-vector solver. Sect. 2.4 reviews local search.

#### 2.1 Constraint Optimization Program (COP)

This work presents a new methodology for solving a wide class of combinatorial optimization problems, which can be expressed as a Constraint Optimization Program, shown in Def. 1.

Definition 1 (Constraint Optimization Program (COP) [2]). *A constraint optimization program is a tuple* (X , D, C, Ψ) *where:*


#### 2.2 The Cell Placement Problem

*Cell Placement (Placement)* is a major stage in the VLSI design cycle [8,16]. The input of the cell placement problem comprises the following components:


We are interested in *feasible* placements, that is, placements in which no cell overlaps other cells or forbidden regions. Given a feasible placement, we define the *size of a net* n ∈ I as the perimeter of the box bounding its placed cells. We define the *size of the placement* as the sum of the sizes of the nets. We are required to find a feasible placement of a minimal size. An example is shown in Fig. 1.

In industrial practice, there may be additional *industrial requirements*, such as aligning some of the cells, enforcing parity constraints (i.e., the user might require the y coordinates of some of the cells to be either even or odd) [19], ensuring a minimal distance between some of the cells and others. We omit further details due to IP considerations.

Placement is NP-Hard-Feasible, since the NP-complete bin packing problem can be reduced to the decision version of the placement problem [10].

2.2.1 Constraint Optimization Program for Cell Placement. We show how to construct a COP for the cell placement problem. For each cell <sup>c</sup> ∈ C, let <sup>c</sup>west and <sup>c</sup>east denote its leftmost and rightmost column respectively, and csouth and cnorth denote its bottom and top row. Strictly speaking, it suffices to use cwest and csouth as the COP's independent variables, but it is convenient to use ceast and cnorth as syntactic sugar for cwest + cwidth and csouth + cheight, respectively. The COP looks as follows:

	- (a) Each cell c is placed wholly within the grid region:

$$(c^{west} \ge 0) \land (c^{east} \le \mathbf{N}) \land (c^{suth} \ge 0) \land (c^{north} \le M)$$

(b) For every pair of cells ci, c<sup>j</sup> , such that i<j, there is no overlap:

$$(c\_i^{west} \ge c\_j^{east}) \lor (c\_j^{west} \ge c\_i^{east}) \lor (c\_i^{suth} \ge c\_j^{north}) \lor (c\_j^{suth} \ge c\_i^{north})$$

(c) For every pair r, c of a forbidden region r and a cell c, there is no overlap:

$$(r^{west} \ge c^{east}) \lor (c^{west} \ge r^{east}) \lor (r^{south} \ge c^{north}) \lor (c^{sout} \ge r^{north})$$


$$||n|| = \left(\max\_{c \in n} (c^{east}) - \min\_{c \in n} (c^{west})\right) + \left(\max\_{c \in n} (c^{north}) - \min\_{c \in n} (c^{sout})\right)$$

$$\Psi = \sum\_{n \in \mathbb{I}} ||n||$$

Fig. 1: Placement example [16]. A solution is shown for the problem of placing five cells c1, c2, c3, c<sup>4</sup> and c<sup>5</sup> of sizes 4×1, 4×3, 2×2, 2×4 and 1×5 respectively, on a grid with *M* = *N* = 8. There are three nets: n<sup>1</sup> = {c1, c3, c5}, n<sup>2</sup> = {c2, c3} and n<sup>3</sup> = {c2, c4} (without any forbidden regions). The bounding boxes of the nets are B1, B<sup>2</sup> and B3, respectively. The sizes of the nets, comprising the perimeters of the bounding boxes, are 20, 18 and 20, respectively. The overall placement size is 20 + 18 + 20 = 58. The solution is an optimal one.

#### 2.3 Solving COP with SAT

A COP can be solved with various types of solvers [2]. In particular, it is possible to solve a COP by reduction to a series of SAT solver invocations through bit-vector reasoning as explained below.

2.3.1 Bit-vector Solving and SAT. We start with reviewing the basic terminology, related to SAT solving. A *literal* l is a Boolean variable v or its negation ¬v. A clause is a disjunction of literals. A formula F is in *Conjunctive Normal Form (CNF)* if it is a conjunction (set) of clauses.

A SAT solver [4] receives a CNF formula F and returns a satisfying assignment (aka, model or solution), if one exists. In *incremental SAT solving under assumptions* [5, 18], the user may invoke the SAT solver multiple times, each time with a different set of *assumption literals* (called, simply, the *assumptions*) and, possibly, additional clauses. The solver then checks the satisfiability of all the clauses provided so far, while enforcing the values of the current assumptions.

A *bit-vector variable (bit-vector)* of *width* n = |B|, B = {vn, v<sup>n</sup>−<sup>1</sup>,...,v1}, is a sequence of n Boolean variables, called *bits*. Bit v<sup>1</sup> is the Least Significant Bit (LSB) and v<sup>n</sup> is the Most Significant Bit (MSB). A *Boolean constant* is either ⊥ (0) or ! (1). A *bit-vector constant* is a bit-vector (BV), each one of whose bits is substituted by a Boolean constant. A *bit-vector term* is either a bit-vector, a BV constant, or a result of applying an operator which returns a bit-vector (for example, BV addition, if-thenelse, concatenation) over other terms and atoms. An *atom* is either a Boolean variable, a Boolean constant or a result of applying an operator, which returns a Boolean (for example, = or unsigned-less-than), over BV terms and atoms. A *bit-vector formula* (also known as a *bit-vector constraint*) is recursively defined to be either an atom, a negation of a bit-vector formula, or the result of applying the Boolean operator ∧ or the Boolean operator ∨ over two or more bit-vector formulas. See [3, 12] for a rigorous description of the BV language. A BV solver decides the satisfiability of BV formulas.

A BV formula F is *satisfiable* iff it has a *model*, that is, an assignment of BV and Boolean constants to their corresponding BV and Boolean variables, which satisfies F. In this paper, BV constants are interpreted as unsigned numbers, and BV comparison operators are interpreted as unsigned. For example, given a bit-vector B = {v3, v2, v1}, the formula F = B < 2 has two models μ<sup>1</sup> : μ1(B)=0 and μ<sup>2</sup> : μ2(B)=1.

All the algorithms presented in this work are assumed to use the so-called *eager* BV solver [6] which, following some preprocessing, translates the input BV formula to an equisatisfiable formula in CNF and solves it with a SAT solver. Thus, we will use the notions of BV solving and SAT solving interchangeably. We also assume the BV solver to have the same incremental API as a SAT solver.

Since the variables in a COP have finite domains, both the variables and the constraints of a COP can be easily expressed as BV variables and BV constraints.

In particular, in the COP constructed for the cell placement problem in Sect. 2.2.1, the variables and the constraints can be expressed as BV variables and constraints as follows: For each cell <sup>c</sup>, we define four bit-vectors: <sup>c</sup>west and <sup>c</sup>east of width "log <sup>N</sup># as well as <sup>c</sup>south and <sup>c</sup>north of width "log <sup>M</sup>#. All the constraints in our COP involve these bit-vectors and can be expressed in terms of operators and relations available in the BV language [3]. Specifically, we implement min and max operators using a series of if-then-else operators. In addition, for every operator, we zero-extend the widths of the operands and the resulting bit-vector to prevent an overflow, whenever required.

Reducing the constraints of a COP to a BV formula and invoking BV solver suffices to find one non-optimal solution. However, for solving the optimization problem by reduction to BV, one needs an extension of BV solving to optimization.1

2.3.2 Extending Bit-vector Solving to Optimization. One can extend bit-vector solving to the so-called Bit-Vector Optimization (OBV) [19] as follows:

A model μ of a BV formula F is T*-minimal*, for a given bit-vector T, iff μ(T) ≤ ν(T) (where the comparison is unsigned) for every model ν of F. Given a BV formula F and a term T = {tn, t<sup>n</sup>−<sup>1</sup>,...,t1} in F, where T is called the *optimization target* (or, simply, the *target*), *Bit-Vector Optimization (OBV)* is the problem of finding a Tminimal model of F. The bits of the target T are referred to as the *target bits*.

Translating our placement COP to OBV is straightforward. We have already shown how to translate the constraints. The optimization target is constructed in the same way as the objective function Ψ is constructed in the COP.

How can one solve OBV in practice? First, one can use the following simple anytime Linear Search algorithm, implemented on top of an incremental BV solver [16,27]:

1: solver.Assert(F); μ := solver.Sat()  assert F and find the first solution 2: while μ is a solution do  while there is still a solution 3: solver.Assert(T <μ(T))  block all the solutions with cost ≥ μ(T) 4: μ := solver.Sat()  can we improve our solution?

5: return μ  μ is guaranteed to be T-minimal

Another anytime algorithm to solve OBV is the following binary search-based algorithm, called OBV-BS [9, 19]:

```
1: solver.Assert(F); μ := solver.Sat()  assert F and find the first solution
2: i := n i is the current bit number, initialized to the MSB
3: while i ≥ 1 and μ(ti) = ⊥ do  fix to ⊥ the MSBs, assigned to ⊥ in μ
4: solver.Assert(¬ti)
```

```
5: i := i − 1  after the loop, i will point to the first target bit, assigned !
6: while i ≥ 1 do  Check one-by-one, if we can flip the remaining target bits to ⊥
7: μ := solver.Sat({¬ti})  run the solver under the assumption ¬ti
```

```
8: if satisfiable then
```

```
9: while (i ≥ 1 and μ(ti) = ⊥) do solver.Assert(¬ti);i := i − 1 endwhile
10: else
```

```
11: solver.Assert(ti);i := i − 1  ti cannot be flipped to ⊥, so we fix it to !
```

```
12: return μ
```
We have successfully applied OBV-BS for solving the problem of *fixing* an existing placement [19], closely related to the generic placement problem, we are exploring

<sup>1</sup> One cannot use MaxSAT [26]–the widely used extension of SAT to optimizing a linear Pseudo-Boolean (PB) function–to solve COP in the generic case, since the objective function is not guaranteed to be linear PB. In particular, it is *not* linear PB for placement, if only because the variables are bit-vectors, rather than Booleans.

in this work. However, both Linear Search and OBV-BS failed to scale to industrial instances of our current problem of finding an optimal placement from scratch (with Linear Search scaling somewhat better than OBV-BS).

Recently, we have introduced the so-called Polosat anytime algorithm [16], which can be used *instead* of the standard SAT solver inside Linear Search (and other SAT-based anytime optimization algorithms) to make it substantially more scalable. The idea behind Polosat, shown below, is to simulate local search using a SAT solver. We use the strictly-monotone version of Polosat [16], which assumes the availability of the so-called Boolean *observable variables (observables) Obs*, that is, a set of Boolean variables on which the objective function depends (for placement, the observables might comprise the bits of the bit-vectors, representing the sizes of the nets, for every net). Polosat is carried out by getting a model μ and then trying to improve it by repeatedly flipping observables, which have not been assigned ⊥ in previous models:

#### 1: function SOLVER.POLOSAT(assumptions)

Require: Target bit-vector T is available; Observables *Obs* are available.

2: μ := solver.Sat(assumptions)  get the first model μ 3: *is good epoch* := 1  good epoch: an iteration, which improves μ 4: while *is good epoch* do  one loop is an epoch 5: *B* := {v : v ∈ *Obs*, μ(v) = !}  remove any observables, assigned ⊥ 6: *is good epoch* := 0 7: while *B* is not empty do 8: b<sup>i</sup> := *B.front*(); *B.dequeue*() 9: σ := solver.Sat(assumptions ∪ {¬bi})  trying to flip b<sup>i</sup> 10: if satisfiable then 11: if σ(T) < μ(T) then μ := σ and *is good epoch* := 1 12: *B* := {b : b ∈ *B*, σ(t)=1}  remove any observables, assigned ⊥

#### 13: return μ

To combine Polosat into Linear Search, it is sufficient to replace solver.Sat invocations by solver.Polosat invocations in the code. <sup>2</sup> We have shown in [16] that replacing plain SAT invocations by Polosat invocations in Linear Search makes our placement tool substantially more scalable. We reaffirm this result in Sect. 5.

Yet, despite the significant progress we had witnessed when applying Polosat, we found that combining Polosat into Linear Search is still insufficient for solving a variety of complex real-world instances of our industrial placement problem. This empirical challenge lead us to develop our LSSO methodology, presented in this paper. As we shall see, combining LSSO and Polosat makes our tool considerably more scalable, while the methodology itself is generic and can be applied to solving a wide range of optimization problems.

#### 2.4 Local Search Algorithms

Local search strategies [1] are a collection of *algorithmic templates*. An algorithmic template specifies the main flow of an algorithm, but leaves some details unimple-

<sup>2</sup> Polosat also uses polarity fixing strategies, such as TORC [14,17], omitted here; please refer to [16] for details. Additional non-anytime OBV algorithms are introduced in [19, 22].

mented. By implementing these details for a specific problem, one obtains an algorithmic solution for that problem.

2.4.1 Basic Local Search Strategy. The basic strategy generates an initial feasible solution and sets it as the *current solution*. Then, it enters a loop. In each iteration, it looks within a *neighbourhood* of the current solution for a feasible solution with a lower value of the objective function. If one is found, it is set to be the current solution. Otherwise, the algorithm is terminated returning the current solution. Note that this version is guaranteed to stop; it does so, when it reaches a *local minimum* of the objective function with respect to the neighbourhood used.

To turn this algorithmic template into a complete algorithm, one has to implement the following *problem-dependent* items: (i) A procedure for generating an initial feasible element. (ii) A *neighbourhood function* assigning to each solution a subset of solutions. (iii) An algorithm for searching the neighbourhood for a better solution.

2.4.2 Neighbourhood Functions. A key factor, which affects both the complexity of the search and the quality of the resulting solution, is the selection of a *neighbourhood function*. In theory, the selection ought to depend on a mathematical analysis of the structure of the feasible set and the profile of the objective function. For complex problems, such an analysis is usually beyond reach. The classical approach to neighbourhood definition is based on the following problem-independent general principles:


However, as we have argued in Sect. 1, this approach is not without issues. In particular, feasibility-preserving perturbations may not be easy to find, especially for NP-Hard-Feasible problems, while having small neighbourhoods implies a low likelihood of high-quality solutions.

2.4.3 Advanced Versions of Local Search. A disadvantage of the basic version of local search is that it may stop at a local minimum of a poor quality, if too small a region of the feasible space is explored. To circumvent this outcome, advanced variants enabling an exploration of larger portions of the feasible space have been devised [7, 29]. Those described here provide some mechanism to escape from the local minimum to "nearby" solutions and resume the search from there. They have been designed to accommodate situations, where local minima are not distributed uniformly in the feasibility space, but are rather clustered in close proximity [25].

The *variable neighbourhood search* approach uses multiple neighbourhoods to escape from local minima. It relies on the fact that a local minimum with respect to one neighbourhood need not be a local minimum with respect to another (if the latter is not contained in the former). The algorithm maintains a set of neighbourhood functions. Once a local minimum with respect to the current neighbourhood is reached, the neighbourhood is switched, and the search is resumed.

The *hill climbing* method allows the selection of a non-improving solution, once a local minimum is reached. Since the objective function no longer monotonically decreases, there is now a possibility of a cycle: a solution may be visited more than once forcing the search into an infinite loop. One can deal with this problem in various ways: ignore it and let the algorithm run until the timeout expires, use randomization, or introduce data structures that keep track of the search history and prohibit solutions that have already been encountered. The latter approach is referred to as *tabu search*.

Another idea is to use *large neighbourhoods*. This approach increases the size of the explored region and the likelihood of better solutions. However, large neighbourhood search may become intractable.

The *iterated local search* approach can be viewed as "a local search within a local search". In each iteration of the search, it uses a *subsidiary search algorithm* to explore iteratively a feasible sub-space. Once a local minimum is returned, a new search is initiated in a region, whose elements are obtained by "perturbing" the recent solution.

All the above approaches can be implemented within our LSSO framework. The key difference between LSSO and previous approaches is using SAT or Polosat as an oracle for both finding the initial solution and carrying out the neighbourhood search.

#### 3 Local Search with SAT Oracle (LSSO)

This section introduces the main contribution of our paper. We propose using SAT as an oracle in local search algorithms to address the scalability and quality issues that arise in the classical local search algorithms, especially, given an NP-Hard-Feasible problem.

Given a combinatorial optimization problem, the first stage in designing an LSSO solution is expressing the problem as a COP.

In the second stage, the COP decision variables are translated to bit-vectors, and the feasibility constraints are translated to a BV formula (including any additional industrial requirements). One might experiment with several alternative formulations and select the one deemed best.

The third step is defining the so-called neighbourhood generators. A *neighbourhood generator* N (μ) accepts as an input a solution μ (that is, a model to the bit-vector formula, representing the COP), and generates *neighbourhood constraints*. The set of all the assignments which satisfy the feasibility and neighbourhood constraints constitutes the neighbourhood of the solution. Thus, finding such an assignment amounts to finding an element of the neighbourhood of μ.

A key ingredient of our methodology is the adoption of a neighbourhood concept, which differs significantly from the classical one, described in Sect. 2.4.2:

1. The neighbourhood need not be small and need not contain (only) elements "close" to the current solution.


Note that, in our approach, neighbourhoods direct the search to "higher-quality" regions with respect to the current solution, regardless of the algorithmic difficulties of searching such regions. This is another key aspect of our approach: we trust SAT solvers to search complex sub-spaces efficiently.

Having discussed neighbourhoods, we are now ready to describe the simplest LSSO implementation:

	- (a) The algorithm obtains an initial solution by asserting the feasibility constraints and asking the solver for a model. This model is set as the *current solution* μ.
	- (b) The algorithm enters a loop, in which the solver operates in incremental mode. In each iteration, the algorithm calls the neighbourhood generator with the current solution as input, to generate a list of BV constraints. These are provided to the solver, which is asked for a model. If a model α is found, μ is set to α. Otherwise, the algorithm terminates returning μ.

The neighbourhood constraints can be given to the solver as either *assumptions* or *assertions*. This leads to two types of search, providing a tradeoff between execution time and quality:


Alg. 1 depicts our implementation of LSSO. The algorithm receives four inputs. The Boolean inputs VNS, HC, and SPEC specify whether variable neighbourhood search, hill climbing, and speculative search are to be used. All combinations are possible, except that *hill climbing requires speculative search*. The input Nmax applies to variable neighbourhood search. It specifies an upper bound on the number of consecutive neighbourhood switches without finding a solution. If that bound is exceeded, the algorithm terminates with the current solution. To effect variable neighbourhood search, the algorithm uses a predefined list of neighbourhood generators N = [N0(μ), N1(μ)... ]. The first generator N0(μ) is considered the default and is used most of the time. The others are used to escape local minima.

Alg. 1 carries out iterated local search with Polosat as an oracle, where the observables are recommended to be set to the bits of the inputs of the objective function. One can also replace the Polosat invocation by an ordinary SAT invocation.

#### 4 LSSO Algorithms for the Cell Placement Problem

This section presents our LSSO-based placement algorithms. All the algorithms are instantiations of Alg. 1 with different sets of parameters. The BV constraints are generated by translating the COP constraints, as discussed in Sect. 2.3. Each algorithm uses some of the neighbourhood generators defined in Sect. 4.1.

The algorithms are presented in Sect. 4.2. None of the algorithms define the target bit-vector explicitly, since they rely on local search instead of OBV solving. By default, the algorithms use Polosat as the oracle, where the observables comprise all the bits of the bit-vectors, representing the sizes of the nets, where the size of net n is given by the following bit-vector term (for every intermediate term and the resulting term n, its width is set to the minimal possible width which prevents an overflow, where the operators are zero-extended, whenever required):

$$||n|| = \left(\max\_{c \in n} (c^{east}) - \min\_{c \in n} (c^{west})\right) + \left(\max\_{c \in n} (c^{north}) - \min\_{c \in n} (c^{sout})\right)$$

#### 4.1 Neighbourhood Generators

4.1.1 Neighbourhood Generator *N***1**. Let μ be a placement, that is, a model to the bit-vector formula representing the feasibility constraints. The neighbourhood N1(μ) is designed for a highly localized fast search at the possible expense of quality. To this end, the constraints corresponding to N1(μ) force a decrease of the objective function in a very constrained manner, so as to help the solver to come back quickly. N1(μ) consists of all of legal placements, for which all the nets are no bigger and at least one net is smaller than under μ, thus ensuring a lower cost. The constraints are:

$$\left(\bigwedge\_{n\in\mathbb{I}}^{each\ net\ is\ no\ bigger}\right)\limits\_{n\in\mathbb{I}}\limits\_{n\in\mathbb{I}}^{at\ least\ is\ no\ bigger}\limits\_{n\in\mathbb{I}}\limits\_{n\in\mathbb{I}}^{at\ least\ one\ net\ is\ small\ smaller}$$


hood switches without a model has not exceeded the bound, move to next neighbourhood

$$
\begin{array}{cc}
\text{24:} & \text{if } \mathcal{V} \mathcal{N} \mathcal{S} \land \left( i < (\mathcal{N}\_{max} - 1) \right) \textbf{then} \\
\end{array}
$$

25: i ← i + 1

26: continue



4.1.2 *N***2**: a Family of Neighbourhood Generators. The N<sup>2</sup> family is designed for *variable neighbourhood search*. Each of its neighbourhoods strictly contains N<sup>1</sup> and allows the objective function to decrease in more ways. This implies higher quality solutions at the expense of slower convergence. To define the <sup>N</sup><sup>2</sup> family, let <sup>α</sup> <sup>=</sup> <sup>I</sup> be the number of the nets and assume α ≥ 3. For each permutation σ of [1 ...α] and positive number 2 ≤ d<α we define a neighbourhood function N2[σ, d](μ) as follows: Let n<sup>σ</sup>(1),...n<sup>σ</sup>(α) be the permuted sequence of the nets. Partition this sequence into "α/d# segments of size d (last segment could be shorter). The neighbourhood N2[σ, d](μ) consists of all of legal placements, for which the sum of the net sizes of each segment is no bigger than under μ, and the sum of at least one segment is smaller. Note that this ensures a cost lower than the placement under μ. By choosing different pairs σ, d, one may obtain different neighbourhoods. The constraints are:

$$\left(\overbrace{\bigwedge\_{k=1}^{\lceil \alpha/d \rceil} \left(\sum\_{i=(k-1)d+1}^{\min(kd,\alpha)} ||n\_{\sigma(i)}|| \le \sum\_{i=(k-1)d+1}^{\min(kd,\alpha)} \mu(||n\_{\sigma(i)}||) \right)}\right)$$

$$\bigwedge\_{\begin{subarray}{c} \text{at least one sum is smaller} \\ \bigvee\_{k=1}^{\lceil \alpha/d \rceil} \left(\sum\_{i=(k-1)d+1}^{\min(kd,\alpha)} ||n\_{\sigma(i)}|| < \sum\_{i=(k-1)d+1}^{\min(kd,\alpha)} \mu(||n\_{\sigma(i)}||) \right) \end{subarray}}$$

4.1.3 Hill-climbing Neighbourhood Generator *N***3**. N<sup>3</sup> is designed to implement *hill climbing*. We reason as follows: If the current placement is not a global minimum, there exists a placement with at least one smaller net. Hence, to *tunnel away* from the local minimum, we generate the following neighbourhood constraints:

$$\underbrace{\begin{aligned} \text{at least one net is smaller} \\ \bigvee\_{n \in \mathbb{I}} |n\| < \mu(\|n\|) \end{aligned}}\_{n \in \mathbb{I}} $$

#### 4.2 LSSO-based Algorithms for Placement

All the algorithms below are instantiations of Alg. 1; they use lists of neighbourhood generators, composed of the ones defined in Sect. 4.1, where hill climbing is carried out by using the neighbourhood generator N3. Due to project deadline constraints, we did not explore other combinations.

	- (a) parameters: VNS = ⊥, HC = ⊥, SPEC = ⊥, Nmax = 1.
	- (b) list of neighbourhood generators: [N1]
	- (a) parameters: VNS = !, HC = ⊥, SPEC = ⊥, Nmax = 10.
	- (b) list of neighbourhood generators: N2[σ, d](μ), enumerated by drawing σ and d by a pseudo-random generator.
	- (a) parameters: VNS = !, HC = ⊥, SPEC = !, Nmax = 10.
	- (b) list of neighbourhood generators: the first generator is N<sup>1</sup> and the rest are N2[σ, d](μ), enumerated by drawing σ and d by a pseudo-random generator.
	- (a) parameters: VNS = ⊥, HC = !, SPEC = !, Nmax = 1.
	- (b) list of neighbourhood generators: [N1]
	- (c) neighbourhood generator N<sup>3</sup> is used for hill climbing.

### 5 Experimental Results

We study the performance of the following algorithms within our placement tool:

	- (a) ls (Linear Search, described in Sect. 2.3.2, with Polosat as the oracle)
	- (b) single nbr nonspec (see Sect. 4.2)
	- (c) many nbr nonspec (see Sect. 4.2)
	- (d) many env spec (see Sect. 4.2)
	- (e) many env spec hill clmb (see Sect. 4.2)
	- (a) bs no polosat [19]: OBV-BS (see Sect. 2.3.2).
	- (b) ls no polosat: Linear Search with SAT as the oracle
	- (c) many env spec hill clmb no polosat: many env spec hill clmb with SAT instead of Polosat (to study the impact of disabling Polosat on LSSO, we chose many env spec hill clmb, since, as we shall soon see, it outperforms the other LSSO algorithms in a pairwise comparison).

We used an extensive set of 1200 proprietary industrial designs of various sizes and complexities. The sizes of the grids (where a *grid size* is the width *N* multiplied by the height *M*) can be characterized as follows: a) Minimum size = 70; b) Maximum = 364000; c) Average ≈ 4643; d) Standard deviation ≈ 18829. We used machines with 32Gb of memory running Intel Xeon processors with 3Ghz CPU frequency.

We ran the algorithms for 600 seconds and measured the quality of the placement at different time intervals. Fig. 2 shows our main results. For each algorithm and time interval, Fig. 2 displays a score which represents the quality. The score is a real number between 0 and 1 inclusive, where the closer the score is to 1 the better. For each algorithm and time interval, the score is computed as follows: we compute the average value of the following score-per-instance: (the result of virtual-best in 600 sec.) / (the result of the current algorithm within the current time interval). Our conclusions:

First, when using SAT as the oracle, Linear Search (ls no polosat) outperforms OBV-BS (bs no polosat), demonstrating that OBV-BS is not useful when the optimization target is a complex arithmetic expression (rather than a vector of lexicographically ordered bits, where each bit is a result of a separate calculation as in [19]). Based on this result, we preferred Linear Search over OBV-BS as the baseline algorithm.

Second, confirming the conclusion of [16], Polosat makes Linear Search substantially more efficient (compare ls to ls no polosat).

Third, and more importantly in the context of this work, our best novel LSSO algorithm even without Polosat (many env spec hill clmb no polosat) is almost as efficient as Linear Search with Polosat (ls), the latter being the state-ofthe-art in solving placement [16]. Moreover, the best Polosat-based LSSO algorithm (many env spec hill clmb) is significantly more efficient than both aforementioned algorithms. This result justifies the usage of both major components of our solution: LSSO–the high-level local search on top a satisfiability oracle, presented in this paper, and Polosat [16]–the low-level local search simulation with SAT.

Finally, the virtual best algorithm yields the absolutely best result, providing evidence that development of different LSSO algorithms pays off.

Additionally, Table 1 shows a pairwise comparison between our four Polosatbased LSSO algorithms. many env spec hill clmb outperforms the others.

Table 2 offers a fine-grained comparison between our best novel LSSO algorithm many env spec hill clmb and the Polosat-based Local Search ls, the latter being the state-of-the-art in solving placement [16]. The comparison is provided per grid size category and for two different timeouts. LSSO improves the performance significantly for every input size category for both timeouts. Comparing the results for the two timeouts on the biggest instances shows that increasing the timeout makes the gap between LSSO and ls more significant, given large grids.

Finally, Table 3 shows the unique contribution of each algorithm to the virtual best in 600 sec. (we dismissed all the instances on which there was more than one bestperforming solver). Notably, each of the LSSO algorithms is a contributor. Surprisingly, many nbr nonspec contributes more than many env spec hill clmb, despite the latter algorithm outperforming the former in a pairwise comparison. A possible explanation is that we ran many nbr nonspec with Polosat only, while many env spec hill clmb was run twice with Polosat and SAT. Another surprising result is the significant contribution of many env spec hill clmb no polosat, second only to many nbr nonspec, implying that a SAT-based LSSO algorithm should be part of any parallel portfolio.


Table 1: Pairwise comparison between LSSO algorithms for the timeout of 600 sec. Each nonempty cell (r, c) contains a comparison between Algorithm R in row r and Algorithm C in column c. The value (w d l) in each non-empty cell is interpreted as follows: R outscored C on w instances; there was a draw on d instances; C outscored R on l instances.

Fig. 2: Comparing Algorithms Over Time


Table 2: Comparing the best Polosat-based LSSO algorithm (many env spec hill clmb) to the Polosat-based Linear Search (ls), the latter comprising the previous state-of-the-art.

### 6 Conclusion

We have presented a new methodology for solving NP-hard combinatorial optimization problems, called Local Search with SAT Oracle (LSSO). Our approach can handle problems for which finding even one feasible solution is already NP-hard. LSSO applies local search which uses a SAT solver or the SAT-based optimization algorithm Polosat as an oracle. We have introduced a generic algorithm which integrates different local search schemes within the LSSO framework. Furthermore, we have implemented our approach in an industrial tool for solving the cell placement problem in VLSI and have shown that our new LSSO approach makes the tool substantially more efficient. Our tool has been successfully productized at Intel.


Table 3: Unique contribution to the virtual best per algorithm (sorted by the contribution).

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Analyzing Infrastructure as Code to Prevent Intra-update Sniping Vulnerabilities**

Julien Lepiller<sup>1</sup> , Ruzica Piskac (-)1, Martin Sch¨af2, and Mark Santolucito<sup>3</sup>

<sup>1</sup> Yale University, New Haven, USA {julien.lepiller,ruzica.piskac}@yale.edu <sup>2</sup> Amazon Web Services, NYC, USA schaef@amazon.com

<sup>3</sup> Barnard College, Columbia University, NYC, USA msantolu@barnard.edu

**Abstract.** Infrastructure as Code is a new approach to computing infrastructure management that allows users to leverage tools such as version control, automatic deployments, and program analysis for infrastructure configurations. This approach allows for faster and more homogeneous configuration of a complete infrastructure. Infrastructure as Code languages, such as CloudFormation or TerraForm, use a declarative model so that users only need to describe the desired state of the infrastructure. However, in practice, these languages are not processed atomically. During an upgrade, the infrastructure goes through a series of intermediate states. We identify a security vulnerability that occurs during an upgrade even when the initial and final states of the infrastructure are secure, and we show that those vulnerability are possible in Amazon's AWS and Google Cloud. We call such attacks intra-update sniping vulnerabilities. In order to mitigate this shortcoming, we present a technique that detects such vulnerabilities and pinpoints the root causes of insecure deployment migrations. We implement this technique in a tool, H¨ayh¨a, that uses dataflow graph analysis. We evaluate our tool on a set of open-source CloudFormation templates and find that it is scalable and could be used as part of a deployment workflow.

#### **1 Introduction**

Managing an infrastructure of thousands of hosts, with different software and servers is nearly impossible to do manually. A relatively new approach to infrastructure management is called Infrastructure as Code (IaC). This has given rise to many different tools with a shared goal: helping system administrators manage their infrastructure in the same way as they manage code. Some tools, like Ansible [20], Puppet [23] or Chef [6] are Configuration Management tools: they allow the administrator to specify the entire configuration of one or more running machines and automatically deploy it by connecting to that machine and performing administrative tasks on behalf of the administrator. These tools automatically detect and apply the steps necessary to switch from the current state of a machine to the desired state, specified by the administrator. Similarly, tools like Amazon's CloudFormation [3] or Hashicorp's Terraform [11] read a description of the desired infrastructure and automatically take the necessary

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 105–123, 2021. https://doi.org/10.1007/978-3-030-72013-1 6

Fig. 1: A deployment of a computation (the orange lambda), accessing a database (the blue disk stack), which is accessible to the outside world through an API (the purple gateway). The upgrade should change the computation to access more sensitive data (the lambda with the subscript 2), but be authenticated through a user check (the red identification checks).

steps to deploy that infrastructure. In CloudFormation, an infrastructure configuration is declared as a set of resources.

Benefits of IaC are well-known among practitioners: the entire infrastructure is described accurately by a configuration file, making it easy to debug or visualize the infrastructure. This way the infrastructure can be version controlled and documented as any other programming language. The tools help guarantee identical configuration of hosts, making it an essential practice for security and maintainability.

However, for all the benefits IaC brings, it also opens new security vulnerabilities. We have identified a new class of vulnerability issues that appear while the tool is operating on the infrastructure. In order to decrease infrastructure upgrade times, deployment tools typically will run many operations in parallel. We argue that this parallelism, as well as the global naming used in these infrastructures, can lead to discrepancies during the upgrade that lead to a violation of the intended security policy, even if the initial infrastructure and the target infrastructure are both perfectly secure. We empirically validate our claims by reenacting this vulnerability in both, Amazon's AWS and in Google Cloud.

#### **1.1 Proof of Concept**

When upgrading the infrastructure, if operators do not provide enough dependencies, ie. they do not impose an ordering on upgrade operations, a security policy and a protected service might be upgraded in an order that exposes private data. Consider an example given in Figure 1: an API service that replies to any request with some benign information, as depicted in Fig. 1a. The service is upgraded so that the API returns private information about users, and the security policy is modified to allow only authenticated users to access the service, as shown in Figure 1d. This architecture is a core architectural building block for serverless computing. This same configuration is recommended in AWS's "Well Architected" developer guideline series [1]. The upgrade code is functionally correct and implements the desired change, but the user did not specify ordering constraints. However, without such constraints, there are two possible upgrade plans. First, as shown in Figure 1b, the backend computation may be updated first. In this case, since the authentication has not yet been added to the API, there is a short period of time where private data is publicly accessible. The amount of time this information is exposed depends on the cloud service provider and the particulars of the infrastructure, but typically ranges on the order of seconds to minutes. We call this kind of attack intra-update sniping vulnerability. The second possible upgrade order, shown in Figure 1c, implements the desired secure update order. Enforcing the second ordering requires the user to explicitly specify an ordering constraint that the authentication must be added before the backend computation is updated.

Another instance of intra-update sniping vulnerability happens when components are added or removed from an infrastructure, but no ordering constraints are given between them and components that use them. As an example, suppose a user is adding a lamda that reads data from a new S3 bucket. If no dependency is specified, the lambda could be created and connected to the bucket before CloudFormation recognizes that the bucket is already owned. The attacker who owns this bucket may then inject their data into the user's system during the time it takes CloudFormation to notice the naming conflict and roll back the migration. This is related to the issue of S3 bucket namesquatting [15].

Although this paper is mostly focused on Amazon's infrastructure, we have successfully reproduced a similar scenario in Google Cloud, demonstrating that intra-update sniping vulnerabilities are not limited to one cloud provider. We reported this issue to Google, and although they acknowledged the problem, they explicitly stated that it is the responsibility of the user to ensure the security of their deployment.

#### **1.2 Detecting Intra-update Sniping Vulnerabilities**

We propose a tool, H¨ayh¨a, that detects possible intra-update sniping vulnerabilities and proposes solutions to users. H¨ayh¨a allows CloudFormation users to check the security of planned updates to their infrastructure, before they actually deploy the update. Although our tool is specifically engineered to work with CloudFormation, this class of vulnerabilities is not limited to it, and the proposed solution is generic enough to be adopted in any other Infrastructure as Code language.

The main challenge in detecting intra-update sniping vulnerabilities is in determining the underlying issue with common deployment models that lead to the security vulnerability. We identify parallelism and in-place upgrades as the root causes, arguing there is a trade-off in Infrastructure as Code between security and scalability. On the opposite side of this trade-off, some practitioners advocate for Immutable Infrastructure [12] management, which re-builds the entire infrastructures from scratch on each update and only switches atomically to the new infrastructure when it is ready. This practice would guarantee atomicity of updates to the infrastructure and the absence of intra-update sniping vulnerabilities. However, this comes with a huge cost in terms of scalability and does not apply well when statefulness is required (for example, migrating an existing database), making it a less attractive practice.

Naturally, there is a connection between intra-update sniping vulnerability and the problem of data races and concurrent access. Our proposed solution, of adding ordering constraints, is somewhat similar to generic tools in the concurrency domain, such as memory barriers or locks [19,16,24], that add constraints to the order of execution of a program. However, the focus of our work are configuration files that describe infrastructure, not programs. We cannot simply apply existing work, because these configuration files do not have a formal semantics, creating this way an additional challenge for our problem domain.

In summary, we identify the following key contributions of this paper:


#### **2 A Model for Infrastructure as Code**

Our tool, H¨ayh¨a, detects the possibility of a sniping attack in future deployments. It analyzes the given deployment and raises alarms when it detects potential security issues. The tool follows steps that we further detail in this section.

**Step 1: Internal representation.** First, H¨ayh¨a reads the configuration of the current and target infrastructure and translates them to the internal representation. This representation is a dataflow graph identifying which component of the infrastructure has access to which other components, and under which security assumptions. Figure 2 shows two such simplified dataflow graphs that our tool built from arch in Fig. 1. From this graph, H¨ayh¨a learns the desired security level of each component. In this section we describe how to compute security levels of resources in a given CloudFormation file: in Section 2.1 we describe the concrete syntax of a general CloudFormation file and how it applies

Fig. 2: Dataflow graphs derived from an infrastructure

to other IaC tools; in Section 2.2 we describe how we model an infrastructure in terms of network communication and security; finally, in Section 2.3 we show the execution semantics and computation of the security level of resources in an infrastructure.

**Step 2: Capturing all potential upgrade states.** After the initial and target configurations are converted to our model, H¨ayh¨a builds an upgrade state, designed to represent every possible intermediate infrastructure that could exist during the upgrade. In Section 2.4 we formally define the upgrade semantics from an initial state to a target state in terms of our model, while in Section 3.1 we show how the upgrade state is built in practice. Figure 3 shows such a state, in form of a graph, which contains a path (Web to PublicGet to PrivateLambda) allowing any user on the web to access a sensitive resource in a non-secure manner. Finally, in Section 3.2 we discuss how dependency relations refine the upgrade state.

Fig. 3: Upgrade State with a Path Exposing a Security Vulnerability

**Step 3: Analysis.** (Section 3.3) H¨ayh¨a computes an over-approximation of the intermediate states and the security level of their nodes in order to answer two questions: 1) is every node in every possible intermediate state at least as secure as the corresponding node in the initial or target configuration? and 2) does every node in every possible intermediate state communicate only with existing nodes? Any possible violation is reported to the user so they can take action and modify their target configuration accordingly. For example, using the

DependsOn keyword, one can enforce build orders in a CloudFormation file. For Figure 3, H¨ayh¨a reports the possible insecure access to PrivateLambda:

Resource PrivateLambda is not sufficiently protected, it needs at least Authorizer and is protected by None during upgrade. Add DependsOn properties to ensure correct security.

#### **2.1 CloudFormation Infrastructures**

CloudFormation uses a declarative language in which users can specify the desired state of their system. An example of a CloudFormation file is given on the left side of Figure 4. It shows a simplified example of an infrastructure in which an API can be called to access the result of running a Lambda (a simple function). There are no formal semantics for CloudFormation files [4,9] – they are simply YAML or JSON files created from the given AWS CloudFormation templates. Other tools, such as Terraform by HashiCorp, follow a similar template-based design.

To formalize the behavior of IaC languages, we would also need to formalize the precise behavior of components. However, these components are very diverse, ranging from firewalls and HTTP servers to general purpose machines or even entire network configurations. Fortunately, the intra-update sniping vulnerability is independent from the precise behavior of individual components, and we only need to analyze the network and security behavior of the infrastructure. We only track the security level of requests, and abstract away from their content. To describe our model, we need to introduce three concepts used in IaC:

A component of the infrastructure is called a resource. Every configuration file declares a set of resources and their configurations (e.g. Figure 4). Some resources, like the LambdaExecutionRole and the LambdaPermission are security resources, and they prevent an unauthorized use of other resources. Other resources, like the GreetingLambda and the GreetingRequestGET are actual running processes, the later also being publicly accessible. Finally, some resources do not correspond to a running process, but to a group of resources such as GreetingApi that gives some configuration value to every resource in the group.

A resource's configuration may reference other resources, and we record that information in our model. Based on the CloudFormation documentation, we distinguish different types of references that we list below:


```
CloudFormation File Corresponding Model
{ "Resources": {
"LambdaPermission": { LambdaPermission [security]
 "Type": "AWS::Lambda::Permission", intrinsic security: LambdaPermission,
 "Properties": {
   "FunctionName": "GreetingLambda", connection security(GreetingApi, Greet-
   "SourceArn": "GreetingApi" ingLambda, this)
 }
},
"GreetingLambda": { GreetingLambda
 "Type": "AWS::Lambda::Function", intrinsic security: 
 "Properties": {
   "Role": "LambdaExecutionRole"
 }
},
"GreetingRequestGET": { GreetingRequestGET [public]
 "Type": "AWS::ApiGateway::Method", intrinsic security: ,
 "Properties": {
   "Integration": "GreetingLambda", network(this, GreetingLambda),
   "RestApiId": "GreetingApi" collects(GreetingApi, this)
 }
},
"GreetingApi": { GreetingApi [collection]
 "Type": "AWS::ApiGateway::Api" intrinsic security: 
},
"LambdaExecutionRole": {
 "Type": "AWS::IAM::Role"
 "Properties": {
   ...
 }}}}
```
Fig. 4: Mapping Between a CloudFormation File and our Model

Each of these reference types can be present in any resource, any number of time. The resource it is declared in can take any role in the relation that it defines, and we represent the resource as this in the model, as shown on the right side of Figure 4.

In CloudFormation, a dependency is declared by using e.g. the DependsOn keyword. A dependency restricts the order in which updates can occur: before a resource can be updated, all the resources it depends on must have been updated.

#### **2.2 Model of a CloudFormation Infrastructure**

We now describe a model for a CloudFormation infrastructure. We define a state S = (R, D) as a set of resources and a partial order that represents the dependency relation between resources. A resource is a tuple composed of a name (string), a type, an intrinsic security context, an origin flag, the different types of references discussed above, and the original configuration of the resource.

With (id, id ) ∈ D we denote that id depends on id , and that id cannot be upgraded until id is upgraded.

The origin flag denotes whether the resource comes from the initial state or the target state during an upgrade, but it is not used at all when dealing with a single state. Similarly, the original configuration's type is not further defined, and depends on the vendor. It is not used for a single deployment, and we only use it to check for equality of resources when updating an existing deployment.

Inspired by Abstract Interpretation [10], we define a security context as an abstract domain with a partial order and some abstract operations: a top, a bottom, a meet, and a join. When two security contexts are comparable (x % y), we say that x is less permissive than y, or that x is more secure than y.

We define predicates that can help us to express some properties of resources in a specific state S: collection(r), resp. security(r), means that r is a resource whose type is that of a collection resource, resp. a security resource. We use public(r) to denote when r is a resource whose type is that of a resource that can be accessed from anywhere on the internet (although this might be restricted with security references), or if it is contained in a collection that is itself publicly accessible.

**Definition 1 (connection).** A connection is possible between two resources when there is a network reference between them or resources that collects them.

ref(r, r ) ⇐⇒ ∃c, c . ∧ ⎧ ⎨ ⎩ network reference(c, c ) r = c ∨ collects(c, r) r = c ∨ collects(c , r )

The security of a connection is the minimum security level a request from r must have to be able to reach r directly. This definition reflects the fact that, when a connection is secured by multiple security resources, it must have sufficient authority to be accepted by all of them.

#### **Definition 2 (connection security).**

$$security(r, r') \iff \sqcap \left\{ \begin{array}{l} \\ \exists c, c'. \lor \begin{cases} \textit{incoming} \,\, protein(c, s) \\ \textit{outgoing} \,\, protection(c', s) \\ \textit{comtion} \,\, protection(c, c', s) \end{cases} \right\} \\ \text{with } \land \left\{ \begin{array}{l} (r = c \lor collects(c, r)) \\ (r' = c' \lor collects(c', r')) \end{array} \right\} \end{array} \right\}$$

#### **2.3 Execution Semantics**

The execution semantics for our intermediate representation is given below. The semantics explains which resources are allowed to talk to which resources, and under which security level. When we write L & r → r , it means that r is allowed to send a request to r , under the security level L.

A request can come from the internet (represented with the constant W) and reach a public resource r if it has a sufficient security level L. Similarly, a request can come from a resource r and reach r if it has a sufficient security level, r is not a collection, and both resources have an adequate configuration that allows them to communicate.

OutsideRequest r ∈ R ¬collection(r ) L % security(W, r ) public(r ) L & W → r

$$\text{InternalRequire} \xrightarrow{\begin{array}{c} (r, r') \in R^2 \ \ \ \negcollection(r') \quad L \sqsubseteq security(r, r') \ \ \mathit{ref}(r, r') \ \hline \mathit{ref}(r, r') \ \hline \mathit{L} \vdash r \rightarrow r' \end{array}} $$

A path P is a finite sequence of resources whose first resource is public, and subsequent resources can be reached from the previous, using the above semantics under some security level. The security of a path is then defined as the minimal security level under which every node can be reached in the above semantics:

$$security((r\_1, \ldots, r\_n)) = \bigwedge\_{i=1}^n secarray(r\_{i-1}, r\_i).$$

with r<sup>0</sup> = W. We note W →<sup>∗</sup> r the set of paths whose last element is r. Similarly, the security of a node is defined as the minimal security level under which the node can be reached by at least one path:

$$Sec(r) = \vee \{ succity(P) | P \in W \to^\* r\}$$

When the infrastructure, under which we consider the security of resources, is not clear from the context, we clarify that with a subscript SecS(r).

**Definition 3 (Substate).** When comparing two states, S<sup>1</sup> and S2, we say that S<sup>1</sup> ⊆ S<sup>2</sup> when


Our first lemma states that, when a state is a substate of another, its nodes are at least as secure as the other.

#### **Lemma 1 (Substate Security).** ∀S1, S2. ∀id ∈ S1. S<sup>1</sup> ⊆ S<sup>2</sup> =⇒ Sec<sup>S</sup><sup>1</sup> (id) % Sec<sup>S</sup><sup>2</sup> (id)

Proof. We note that by definition, id is in both states. Additionally, any path in S<sup>1</sup> is also a path in S2, and since the security of connections in S<sup>1</sup> is more secure than the same connections in S2, the security of paths in S<sup>1</sup> is greater than the security of the same paths in S2.

The security of a node is the meet of the security of paths that lead to it in the state. Paths that lead to id is S<sup>2</sup> are the paths that lead to it in S1, and potentially additional paths. Therefore, the security of id in S<sup>1</sup> is greater than in S2.

#### **2.4 Upgrade Semantics and Security Policy**

In IaC tools, an upgrade changes a given infrastructure state to a new state. This is done by upgrading each node that needs to be changed as specified by the new configuration. Generally, nodes are upgraded in an unspecified order, even in parallel, to improve deployment speed. Node updates are sent asynchronously to every service that needs to be updated, and there are dozens if not hundreds of steps each service must take to complete its update. When these upgrades are sent in parallel, it is difficult to reason about the state of the system as the running time for a node upgrade depends on the latency of the service. To model this behavior, we define an interleaving semantics for upgrades.

An upgrade starts in an initial state S<sup>i</sup> and ends in a target state St. Additional dependency ordering information is provided by the relation D of the target state.

The configuration of an identifier can be updated if all its dependencies are already updated (∀id ,(id, id ) ∈ R =⇒ S(id ) = St(id )), and it has not been updated yet:

$$\text{UppgradeConf} \xrightarrow{S(id) \neq S\_t(id)} \frac{\forall id', (id, id') \in R \implies S(id') = S\_t(id')}{S \to S[id \gets S\_t(id)]}$$

A new resource can be created under the same conditions, if it was not present in the initial state:

$$\text{UppgradeAdd} \xrightarrow{id \notin S \quad \forall id', R(id, id') \implies S(id') = S\_t(id')}$$

An identifier can be removed, if it is not in the target state:

$$\text{UpgradeDel} \xrightarrow{id \notin S\_t \quad id \in S} S \quad \begin{array}{l} id \in S \\ \hline S \to S \lor id \end{array}$$

We collect every accessible intermediate state in a set denoted by Acc:

$$\text{AccInit} \xrightarrow[S\_i \in Acc]{} \quad \text{AccNext} \\ \frac{S \in Acc \quad S \to S'}{}$$

Note that, in the absence of any dependency, Acc contains every combination where each resource is either at its initial or target configuration, leading to 2<sup>n</sup> possible intermediate states when n is the number of changed resources.

We next show that, when two identifiers are in a dependency relation, some intermediate states are not possible. For ease of expressing this lemma, we extend equality to also check whether id is in the domain of S. If id is neither in S nor S , we have S(id) = S (id). Otherwise, id must be in both and associated to the same configuration for the equality to hold.

#### **Lemma 2 (Dependency Restriction).**

∀(id, id ) ∈ R, S ∈ Acc =⇒ S(id) = St(id) ∨ S(id ) = Si(id ) ∨ St(id) = Si(id) ∨ St(id ) = Si(id )

Proof. By induction of S ∈ Acc and by case analysis on the inequality that holds in the inductive case.

We now define the security policy as:

**Definition 4 (Security Policy).** A deployment from S<sup>i</sup> to S<sup>t</sup> is secure iff:

$$\forall S \in Acc, \forall id, \begin{cases} Sec(S, id) \sqsubseteq Sec(S\_i, id) & \text{if } S\_i(id) = S(id) \\ Sec(S, id) \sqsubseteq Sec(S\_t, id) & \text{if } S\_t(id) = S(id) \\ Sec(S, id) = \bot & \text{otherwise (id is not in } S) \end{cases}$$

Our work focuses on security issues that happen during upgrades, assuming that the initial and target states are both secure. We require that in any intermediate state any resource is at least as secure as their counterpart in the initial or target state, depending on where their configuration comes from.

#### **3 Architectural Design of the H¨ayh¨a Tool**

#### **3.1 Upgrade States**

To verify the security of intermediate states, we could compute all the possible intermediate states and pass them to existing tools that could check the security of such states. However, this approach has two main drawbacks. First, we would need to construct 2<sup>n</sup> intermediate states, which does not scale for large infrastructure changes. Second, the result of such tools would not be easy to understand for end users, as they would report issues with states that are not defined or even considered by the user. Our goal is a tool that is both scalable and able to provide suggestions on how to change the target configuration, not some hidden intermediate configuration.

Fig. 5: Example Upgrade State

To address scalability we introduce upgrade states which represent multiple states on which we can apply the same execution semantics. Recall that a state is composed of a list of resources with their origin, type and references, and of a dependency relation. An upgrade state is composed in the same way. The set of resources is the union of the resources from the initial and target states, excluding initial resources that only differ from their target counterpart by their

provenance flag. When resources are added or removed from an infrastructure, we introduce an empty resource for each of them. They represent the absence of these resources. The dependency relation of the upgrade state is the dependency relation of the target state.

The execution semantics of an upgrade state is the same as the execution semantics of a normal state. Since the upgrade state represents multiple versions of the same resources at the same time, we need to change the definition of the security level of a connection between resources. An example of an upgrade state is given in Figure 5. The initial state has an API, a GET method and a lambda, and everything is public. The target state modifies the lambda and adds an authorizer. The upgrade state is comprised of the unchanged API, the target authorizer (with an empty resource as its initial counterpart), the GET method (which did not change), and the two variants of the lambda. The connection to the GET method is protected either by the empty node (!) or the target authorizer. The minimal security level for this connection is therefore !.

In summary, when a security resource is relevant for a connection, we need to consider its counterpart that has a different provenance flag. If it is also relevant, the connection is protected by the disjunction of the security level of these resources (they cannot both exist at the same time, but one of them exists at any given time). If it is not relevant, the upgrade state represents at least one case where the security resource is not relevant, meaning that the connection is protected by the disjunction of the first security level and !, which is ! (no security at all). If the counterpart is an empty resource, the upgrade state represents at least one case where the security resource was deleted (or not yet added), so the connection is also unprotected. If there is no counterpart, the connection is simply protected by the resource, because it does not change in any way during the upgrade.

We denote by U(Si, St) the upgrade state created from the initial state S<sup>i</sup> and the target state St. We now show that this state indeed collects all possible intermediate states.

#### **Lemma 3 (Upgrade Graph is an Overapproximation).**

∀S ∈ Acc.S ⊆ U(Si, St)

Proof. To apply the definition, we first show resources of S are resources of U. Then, we show that any connection in S is a connection in U, because resources come with the same references in both states.

#### **3.2 Splitting Dependencies**

We have seen that the upgrade state created from the initial and target configurations is an over-approximation of all the intermediate states, when we do not consider dependencies. Because dependencies reduce the number of intermediate states, the upgrade state might not be precise enough and might produce a warning when no actual intermediate states violate the security policy.

**Variants**. When the state has two nodes A and A with the same identifier, but a different label, we call them a variant of one another. When A belongs to the initial configuration and A to the target configuration, (A, A ) is called an upgrade pair.

We refine the upgrade state by splitting it along a dependency. Considering a state S, its dependency relation D, and two target resources (A , B ) ∈ D, the split of S, split(S, A , B ) is a set of upgrade states. Suppose A and B are, respectively, part of an upgrade pair (A, A ) and (B,B ). Then, split(S, A , B ) is the set of three upgrade states, where only one of A or A remains, and only one of B or B . We exclude the case where A and B remain. When any of these nodes does not exist, the number of possible combination is reduced. When only A and B exist in S, we have found an impossible situation, and the result of splitting is the empty set.

Although this process creates an exponential number of states, the number of dependencies tends to be limited in practice, because they slow upgrades down. At the same time, a big number of dependencies actually reduces the number of possible intermediate states, until every node is in a dependency, in which case there are exactly n intermediate states.

We now prove that splitting the upgrade state is correct, in the sense that the set of states split(S) still contains all the possible intermediate states (Acc):

#### **Theorem 1 (Correct Split).**

∀S ∈ Acc. ∃u ∈ split(U(Si, St)). S ⊆ u

Proof. Let us take a state S ∈ Acc from the set of all possible intermediate states. Since splitting a state according to a dependency preserves the states from Acc (Lemma 4 below), we can consider every dependency and split them in any order. Initially, it holds that S ⊆ U(Si, St), using Lemma 3.

Consider an upgrade state u such that S ⊆ u and D(id, id ). By Lemma 4, we can find a state u ∈ split(u, id, id ) such that S ⊆ u .

After applying this for each dependency, u is one of the states resulting from split(U(Si, St)), and the claim of the theorem holds.

The following intermediate lemma is needed to prove the correction of the split. It states that if a state contains one of the accessible states, splitting a dependency in it results in a set of states, where one of them still contains this intermediate state.

**Lemma 4 (Split Graphs).** ∀S ∈ Acc. ∀(id, id ) ∈ D. S ⊆ u =⇒ ∃u ∈ split(u, id, id ), S ⊆ u

Proof. Take (A, A ) the upgrade pair whose identifier is id. Similarly, take (B,B ) the upgrade pair whose identifier is id . Since S ∈ Acc, A and B cannot both exist at the same time in S (Lemma 2). Since S ⊆ u, we also know that u has at least one variant of id and one variant of id , the ones that are present in S.

The states from split(u, id, id ) are composed of the same nodes as u, except for id and id , where they all have one of the four possible combinations of initial and target states, except for the pair A , B. Since S doesn't have them both either, one states has the same variants of id and id as S, and we call it u . We now show that S ⊆ u .

First, we note that u has the same nodes as u, except for those with identifier id and id . For any resource in S, the resource was present in u, so it is also in u , unless it has identifier id or id . For this last cases, we note that u is defined to contain the same variants as S, so the resources of S are also resources of u .

Second, if we take L & r → r in S, we can use the same reasoning as in Lemma 3 to conclude that is also holds in u . Thus we conclude that S ⊆ u .

#### **3.3 Finding Vulnerabilities**

After H¨ayh¨a constructs the upgrade state, the next step is to check for security issues. Although we could split the upgrade state recursively until no dependency remains, a more interesting strategy is to immediately check the upgrade state for issues. If none is found, it is not necessary to refine the upgrade state. Otherwise, we try to find a relevant dependency and split the upgrade state on it, running the analysis on the resulting states, splitting on other dependencies as needed.

Our analysis detects two types of issues: first, if an empty node is accessible, it might be used by the infrastructure at a point it is not registered by the owner of the infrastructure. This is the case for a new node that is accessible before it is created. When that node is a resource that can be claimed by a third party (such as an S3 bucket), the attacker might be able to register it before the user. Similarly, for a deleted resource, an attacker could register it for themselves before the user stops using it.

Second, the security context of every node in the upgrade state is compared to the security of the same node in the initial or target state (depending on its provenance flag). When its security is strictly lower than the security of the node in the state it comes from, or incomparable, we raise an alarm because there is an intermediate step where the resource might not be sufficiently protected.

Using Lemma 1 and Theorem 1, when the security of a node in a possible intermediate state (collected in Acc) is insufficient, the security of that node in at least one split upgrade state is even lower. Therefore, if there is a violation of the security property, our tool will detect it.

#### **4 Experiments**

H¨ayh¨a is designed to be used before the deployment of a CloudFormation update, and it is crucial that H¨ayh¨a does not interrupt developer workflow. Our goal was, therefore, to evaluate the scalability of H¨ayh¨a on a variety of real-world CloudFormation updates. To do this, we collected 36 CloudFormation files from GitHub, where each file had a history of updates (commits). We ran H¨ayh¨a against every update recorded in GitHub to that file, and measured the running time. We found that our analysis completed within one seconds for all files – we believe that these results indicate that H¨ayh¨a could be integrated in developer workflow with minimal disruption to the user. The details of the evaluation dataset are given in Fig. 6.

Fig. 6: Analysis time of various CloudFormation files from GitHub. Point size is proportional to the number of updated resources, which are between 0 and 31.

To collect the set of GitHub CloudFormation files used in our scalability benchmark, we searched GitHub using the web search tool for code with the keyword AWSTemplateFormatVersion - which is a required keyword for any Cloud-Formation file. We then filtered by the .yaml extension, and further manually filtered for valid CloudFormation files (as opposed to other languages with overlap). Since we wanted to track updates to these files, we also filtered manually to find only files that had a revision history (≥ 2 commits for the file).

While we showed that H¨ayh¨a scales well on real world data, we did not identify any instances of intra-update sniping vulnerability in these files. This is an expected result, as the CloudFormation files we found on GitHub were generally designed as templates that developers would customize to their own needs. We believe application-focused CloudFormation files are not often uploaded, since CloudFormation files can contain sensitive and proprietary information (e.g. infrastrucuture design). In order to run a large-scale analysis to check for past instances of intra-update sniping vulnerability, we would need access to a repository of the private user data for many CloudFormation users.

#### **5 Related Work**

Following the development and use of Infrastructure as Code (IaC) practices, many threats and security challenges were recognized [26,27]. The security risks that have been identified in IaC have thus far remained similar to existing vulnerabilities arising from poor security practices, such as infrequent key rotation and hard-coded secret values [25]. Additionally, despite existing recommendations and good practices when dealing with cloud infrastructure, many existing deployments are still left insecure by user misconfigurations. For example, storage "buckets" which host files, should generally be configured by user to disallow world readable/writable permissions. However, in practice, users struggle with this [8]. Existing work has used SMT solver to automatically detect such vulnerabilities and help users secure their resources [4,9]. In contrast, we focus on the dynamic behavior of deployment updates that occur when using IaC tools, and their effect on security configuration.

Much work has focused on the security of virtualization technologies based on attack models such as malicious cloud users to compromised cloud providers, as summarized in [13]. In our work however, we do not make any assumption on the specific technology, as intra-update sniping vulnerabilities rely mostly on timing and insecure configuration on the user's side.

Our work is based on a graph model of the dataflow network of resources created in an infrastructure configuration. Similarly, Al-Shaer et al [2] propose to model and check network security using a graph-based model of the network. As with other work on the network and infrastructure security [5,18], the focus of the analysis is on the security of static network topologies, instead of the security of a moving topology, as we have in this paper. The analysis of security in static networks and static information flow models [21] is complementary to our work, as we assume the initial and target infrastructure are secure.

Beyond network configurations, there has been work in the analysis of configuration files. In particular, static analysis has been used to check that IaC configurations are idempotent [14,30], an important property for maintaining reproducibility of infrastructure. The reproducibility of infrastructure is known to be a challenge [7], despite IaC being declarative and version controlled. Further efforts have used probabilistic modelling to learn constraints on configurations [22,28,29].

#### **6 Conclusion**

We have identified a new class of vulnerability that applies to Infrastructure as Code services, intra-update sniping vulnerabilities, that arise from a lack of ordering in upgrading resources. We presented a tool, H¨ayh¨a, that detects such vulnerabilities in CloudFormation, and gives feedback to users on how securely update their infrastructure deployment. Our evaluation shows the scalability of H¨ayh¨a by running it on existing configurations from GitHub and found that it runs quickly enough to be usable in practice.

#### **Acknowledgement**

This work was completed while working on the grant supported by the National Science Foundation under Grant No. CCF-1715387, and partially supported by the Office of Naval Research under Grant N00014-17-1-2787.

#### **References**


Interest Group on Data Communication. SIGCOMM '18, Association for Computing Machinery, New York, NY, USA (2018)


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Proof Generation/Validation**

#### **Certifying Proofs in the First-Order Theory of Rewriting***-*

Fabian Mitterwallner<sup>1</sup> (-), Alexander Lochmann<sup>1</sup> , Aart Middeldorp<sup>1</sup> , and Bertram Felgenhauer<sup>2</sup>

**Consistent \* Complete \* Well Documen et d t ysaE \* o Reuse \*** \* **Eva ul det a** \* TACAS \* **Artifact** \* AEC

<sup>1</sup> Department of Computer Science, University of Innsbruck, Innsbruck, Austria fabian.mitterwallner@uibk.ac.at, alexander.lochmann@uibk.ac.at, aart.middeldorp@uibk.ac.at <sup>2</sup> Innsbruck, Austria

int-e@gmx.de

**Abstract.** The first-order theory of rewriting is a decidable theory for linear variable-separated rewrite systems. The decision procedure is based on tree automata techniques and recently we completed a formalization in the Isabelle proof assistant. In this paper we present a certificate language that enables the output of software tools implementing the decision procedure to be formally verified. To show the feasibility of this approach, we present FORT-h, a reincarnation of the decision tool FORT with certifiable output, and the formally verified certifier FORTify.

#### **1 Introduction**

Many properties of rewrite systems can be expressed as logical formulas in the first-order theory of rewriting. This theory is decidable for the class of linear variable-separated rewrite systems, which includes all ground rewrite systems. The decision procedure is based on tree automata techniques and goes back to Dauchet and Tison [7]. It is implemented in FORT [17,18]. FORT takes as input one or more rewrite systems R0, R1,... and a formula ϕ, and determines whether or not the rewrite systems satisfy the property expressed by ϕ, in which case it reports yes or no. FORT may not reach a conclusion due to limited resources.

For properties related to confluence and termination, designated competitions (CoCo [15], termCOMP [9]) of software tools take place regularly. Occasionally, yes/no conflicts appear. Since the participating tools typically couple a plethora of techniques with sophisticated search strategies, human inspection of the output of tools to determine the correct answer is often not feasible. Hence certified categories were created in which tools must output a formal certificate. This certificate is verified by CeTA [21], an automatically generated Haskell program using the code generation feature of Isabelle. This requires not only that the underlying techniques are formalized in Isabelle, but the formalization must be executable for code generation to apply. During the time-consuming formalization process, mistakes in papers are sometimes brought to light.

<sup>-</sup>This research is supported by FWF (Austrian Science Fund) project P30301.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 127–144, 2021. https://doi.org/10.1007/978-3-030-72013-1 7

Since 2017 we are concerned with the question of how to ensure the correctness of the answers produced by FORT. The certifier CeTA supports a great many techniques for establishing concrete properties like termination and confluence, but the formalizations in the underlying Isabelle Formalization of Rewriting (IsaFoR)<sup>3</sup> are orthogonal to the ones required for supporting the decision procedure underlying FORT. We recently completed the formalization of the automata constructions involved in the decision procedure [14]. Earlier fragments were described in [8, 13]. In this paper we put these efforts to the test. More precisely, we


The remainder of the paper is organized as follows. The next section briefly recapitulates the first-order theory of rewriting and the variant of the decision procedure described in [14]. Sections 3 and 4 describe the representation of formulas in certificates and the certificate language. In Section 5 we describe how certificates are validated by FORTify, the verified Haskell program obtained from the Isabelle formalization. Section 6 describes FORT-h. Experimental results are presented in Section 7, before we conclude in Section 8.

#### **2 Preliminaries**

Familiarity with term rewriting [2] and tree automata [6] is useful, but we briefly recall important definitions and notation that we use in the remainder.

Terms T (F, V) are constructed from a signature F, consisting of function symbols with fixed arities, and a set of variables V. A term rewrite system (TRS for short) R consists of rewrite rules → r between terms and r. Instead of the usual restrictions /∈ V and <sup>V</sup>ar(r) ⊆ Var(), we require <sup>V</sup>ar()∩Var(r) = <sup>∅</sup>. Here Var(t) denotes the set of variables in a term t. Moreover, and r are assumed to be linear terms (i.e., variables occur at most once). The conditions on the rewrite rules are necessary to ensure decidability of the first-order theory of rewriting for these linear variable-separated TRSs. The (one-step) rewrite relation of a TRS <sup>R</sup> is denoted by <sup>→</sup>R. A term <sup>t</sup> is ground if <sup>V</sup>ar(t) = <sup>∅</sup>. The set of ground terms is denoted by T (F).

The first-order theory of rewriting is defined over a language L containing the predicate symbols →, →<sup>∗</sup>, =, and many more. As models, we consider finite linear variable-separated TRSs R over signatures F such that T (F) is nonempty. The set T (F) serves as domain for the variables in formulas over L. The

<sup>3</sup> http://cl-informatik.uibk.ac.at/software/ceta/

interpretation of the predicate symbol → in R is the one-step rewrite relation →<sup>R</sup> over T (F), →<sup>∗</sup> denotes the restriction of →<sup>∗</sup> <sup>R</sup> to terms in <sup>T</sup> (F), and = is interpreted as the identity relation on T (F). Since we use ground terms as carrier, formulas in the first-order theory of rewriting express properties on ground terms. For instance, the following formula ϕ expresses the property of having unique normal forms (UNR):

$$\forall s \forall t \forall u \left( s \to^\* t \land \neg \exists \ v \left( t \to v \right) \land s \to^\* u \land \neg \exists \ v \left( u \to v \right) \implies t = u \right)$$

To use ϕ for establishing UNR for arbitrary terms (i.e., terms in T (F, V)) two additional constant symbols need to be added to the signature [18]. (More on this in Section 8.) Additional predicates in L increase the expressive power and also allow expressing properties more compactly. For instance, we can write NF(t) for ¬∃ <sup>v</sup> (<sup>t</sup> <sup>→</sup> <sup>v</sup>) and <sup>s</sup> <sup>→</sup>! <sup>t</sup> for <sup>s</sup> <sup>→</sup><sup>∗</sup> <sup>t</sup> ∧ ¬∃ <sup>v</sup> (<sup>t</sup> <sup>→</sup> <sup>v</sup>). In Section <sup>3</sup> we present a grammar that describes the available constructions for predicates. All predicates that can be represented using these constructions are supported in our decision procedure.

The decision procedure is based on tree automata that recognize relations on ground terms. Here we give a brief summary. More information can be found in [6] and [14]. A tree automaton A = (F, Q, Q<sup>f</sup> , Δ) consists of a finite signature F, a finite set Q of states, disjoint from F, a subset Q<sup>f</sup> ⊆ Q of final states, and a set of transition rules Δ. Transition rules have one of the following two shapes: f(p1,...,pn) → q with f ∈ F and p1,...,pn, q ∈ Q, or p → q with p, q ∈ Q. The latter are called epsilon transitions. Transition rules can be viewed as rewrite rules between ground terms in T (F ∪Q). The induced rewrite relation is denoted by →<sup>Δ</sup> or →A. A ground term t ∈ T (F) is accepted by A if t →<sup>∗</sup> <sup>Δ</sup> q for some q ∈ Q<sup>f</sup> . The set of all accepted terms is denoted by L(A) and a set L of ground terms is regular if L = L(A) for some tree automaton A.

We encode n-tuples with n 1 of ground terms as terms over an enriched signature, as follows. We write <sup>F</sup>(n) for the signature (F ∪ {⊥})<sup>n</sup> where <sup>⊥</sup> ∈ F/ is a fresh constant. The arity of a symbol <sup>f</sup><sup>1</sup> ··· <sup>f</sup><sup>n</sup> ∈ F(n) is the maximum of the arities of f1,...,fn. The encoding of terms t1,...,t<sup>n</sup> ∈ T (F) is the unique term <sup>t</sup>1,...,tn∈T (F(n)) such that <sup>P</sup>os(<sup>t</sup>1,...,t<sup>n</sup>) = <sup>P</sup>os(t1)∪···∪Pos(tn) and t1,...,t<sup>n</sup>(p) = f<sup>1</sup> ··· f<sup>n</sup> where f<sup>i</sup> = ti(p) if p ∈ Pos(ti) and f<sup>i</sup> = ⊥ otherwise, for all <sup>p</sup> ∈ Pos(<sup>t</sup>1,...,t<sup>n</sup>) and 1 <sup>i</sup> <sup>n</sup>. As an example, for the terms s = f(g(a), f(b, b)), t = g(g(a)), and u = f(b, g(a)) we obtain s, t, u = fgf(ggb(aa⊥), f⊥g(b⊥a, b⊥⊥)). An n-ary relation on ground terms is regular if its encoding is accepted by a tree automaton operating on terms in <sup>T</sup> (F(n)). Such an automaton is called an RR<sup>n</sup> automaton and regular n-ary relations are called RR<sup>n</sup> relations. The i-th cylindrification of an RR<sup>n</sup> relation R over T (F) is the RR<sup>n</sup>+1 relation {(t1,...,t<sup>i</sup>−<sup>1</sup>, u, ti,...,tn) | (t1,...,tn) ∈ R and u ∈ T (F)}.

Besides RR<sup>n</sup> automata, the decision procedure makes use of ground tree transducers (GTTs for short). A GTT is a pair G = (A, B) of tree automata over the same signature F. A pair (s, t) of ground terms in T (F) is accepted by G if s →<sup>∗</sup> <sup>A</sup> <sup>u</sup> <sup>B</sup> ∗ →t for some term u ∈ T (F ∪ Q). Here Q is the combined set of states of A and B. The set of all such pairs is denoted by L(G). We denote by La(G) the set of all pairs (s, t) such that s →<sup>∗</sup> <sup>A</sup> <sup>q</sup> <sup>B</sup> ∗ →t for some state q ∈ Q. A binary relation R on ground terms is a(n anchored) GTT relation if there exists a GTT G such that R = L(G) (R = La(G)). The decision procedure for the firstorder theory of rewriting described in [7] and implemented in FORT uses GTTs, the formalized variant described in [14] uses anchored GTTs (aGTTs), which have better closure properties. Both are supported in our certificate language, but FORT-h and FORTify use anchored GTTs since they permit us to model more predicates while reducing the need for ad-hoc constructions that need to be turned into executable (verified) code.

The decision procedure for the first-order theory of rewriting constructs RR<sup>n</sup> automata for the subformulas in a bottom-up fashion. GTTs (aGTTs) come into play for some of the atomic subformulas consisting of predicate symbols and variables. Closure properties take care of the logical structure of formulas. A final emptiness check determines whether the formula is satisfied for the TRS given as input to the decision procedure. Rather than formally stating the properties involved, we illustrate the decision procedure on an example.

Example 1. Consider the formula ϕ = ∀ s ∃ t(s →<sup>∗</sup> t ∧ NF(t)), which expresses the normalization property of TRSs. To determine whether a TRS R over a signature F satisfies ϕ, we first construct an RR<sup>1</sup> automaton A<sup>1</sup> that accepts the ground normal forms in T (F), using an algorithm first described in [5] and recently formalized in [13]. For the subformula s →<sup>∗</sup> t we construct a GTT G<sup>1</sup> for the parallel rewrite relation −→ <sup>R</sup>. Since GTT relations are effectively closed under transitive closure (while RR<sup>2</sup> relations are not), we obtain a GTT G<sup>2</sup> for →<sup>∗</sup> R. This GTT is transformed into an RR<sup>2</sup> automaton A2. (In the formalized decision procedure described in [14], an RR<sup>2</sup> automaton for →<sup>∗</sup> is constructed from an anchored GTT for the root step relation <sup>→</sup> <sup>R</sup>, using suitable closure properties of anchored GTT and RR<sup>2</sup> relations.) We cylindrify the RR<sup>1</sup> automaton A<sup>1</sup> into an RR<sup>2</sup> automaton A<sup>3</sup> that accepts T (F) × NFR. A product construction involving A<sup>2</sup> and A<sup>3</sup> produces an RR<sup>2</sup> automaton A<sup>4</sup> for the subformula s →<sup>∗</sup> t ∧ NF(t). Projection yields an RR<sup>1</sup> automaton A<sup>5</sup> corresponding to ∃ t(s →<sup>∗</sup> t∧NF(t)). So ϕ holds if and only if L(A5) = T (F). In FORT the ∀ quantifier is transformed into the equivalent ¬∃¬. Hence complementation is used to obtain an RR<sup>1</sup> automaton A<sup>6</sup> and the existential quantifier is implemented using projection. This gives an RR<sup>0</sup> automaton <sup>A</sup><sup>7</sup> which either accepts the empty relation <sup>∅</sup> or the singleton set {()} consisting of the nullary tuple (). The outermost negation gives rise to another complementation step. The final RR<sup>0</sup> automaton A<sup>8</sup> is tested for emptiness: <sup>L</sup>(A8) = <sup>∅</sup> if and only the TRS <sup>R</sup> does not satisfy <sup>ϕ</sup>.

#### **3 Formulas**

The first step in the certification process is to translate formulas in the first-order theory of rewriting into a format suitable for further processing. We adopt de Bruijn indices [4] to avoid alpha renaming.

Example 2. Consider the formula

forall s, t, u ([0] s ->\* t & [1] s ->\* u => exists v ([1] t ->\* v & [0] u ->\* v))

in FORT syntax. It expresses the commutation of two TRSs, indicated by the indices 0 and 1. Using de Bruijn indices for the term variables s, t, u, v produces

$$\forall \forall \forall \left(2 \to\_0^\* 1 \land 2 \to\_1^\* 0\right) \implies \exists \left(2 \to\_1^\* 0 \land 1 \to\_0^\* 0\right).$$

We refer to Example 4 for further explanation.

The formal syntax of formulas in certificates is given below. Angle brackets are used for non-terminal symbols. Here rr<sup>2</sup> denotes the supported binary regular relations, which are formally defined after Example 3. Likewise, rr<sup>1</sup> stands for regular sets (which are identified with unary regular relations).

$$\begin{aligned} \langle \langle \langle \mathtt{formula} \rangle ::= & \{ \mathtt{r} \mathtt{r} \langle \langle \mathtt{r} \rangle \langle \mathtt{term} \rangle \} \mid \{ \mathtt{r} \mathtt{r} \mathtt{2} \langle \langle \mathtt{term} \rangle \rangle \} \\ & \qquad \mid \langle \mathtt{and} \langle \langle \mathtt{formula} \rangle \* \rangle \mid \{ \mathtt{or} \langle \mathtt{formula} \rangle \* \} \mid \{ \mathtt{st} \mathtt{is} \mathtt{st} \mathtt{if} \mathtt{m} \} \rangle \\ & \qquad \mid \langle \mathtt{for} \mathtt{il} \mathtt{1} \langle \langle \mathtt{formula} \rangle \rangle \mid \{ \mathtt{true} \} \rangle \mid \{ \mathtt{true} \} \mid \{ \mathtt{false} \} \\ & \qquad \mid \{ \mathtt{restr} \mathtt{it} \mathtt{st} \langle \mathtt{formula} \rangle \mid \{ \mathtt{trs} \} + \} \rangle \\ \end{aligned} $$

De Bruijn indices are used for term variables and nat - denotes a TRS with index nat in which the left- and right-hand sides of the rules have been swapped. The class of linear variable-separated TRSs is closed under this operation. We use it to represent the conversion relation ↔<sup>∗</sup> of a TRS R as the reachability relation →<sup>∗</sup> induced by the TRS R∪R<sup>−</sup>.

Example 3. The commutation property in Example 2 is rendered as follows:

(forall (forall (forall (or (not (and (rr2 (step\* (0)) 2 1) (rr2 (step\* (1)) 2 0))) (exists (and (rr2 (step\* (1)) 2 0) (rr2 (step\* (0)) 1 0)))))))

Here (step\* (0)) denotes the RR<sup>2</sup> relation →<sup>∗</sup> induced by the first TRS (which is indexed by 0) and (rr2 (step\* (1)) 2 0) represents the subformula [1] t ->\* v of the FORT formula in Example 2.

We continue with the certificate syntax of RR<sup>1</sup> and RR<sup>2</sup> relations:

$$\begin{array}{lcl} \langle \begin{array}{lcl} \langle \begin{array}{lcl} \langle \mathsf{r}r\_{1} \rangle & ::= & \mathsf{l} \end{array} & \langle \mathsf{inf} \left\langle \begin{array}{lcl} \langle \mathsf{r}r\_{2} \rangle & \mathsf{l} \end{array} & \langle \mathsf{r}r\_{2} \rangle \end{array} \\ & & \mid \begin{array}{lcl} \langle \mathsf{union} \langle \mathsf{r}r\_{1} \rangle \langle \mathsf{r}r\_{1} \rangle \rangle & \langle \mathsf{initer} \,\langle \mathsf{r}r\_{1} \rangle \rangle \, & \langle \mathsf{diff} \,\langle \mathsf{r}r\_{1} \rangle \rangle \\ \end{array} \\ \end{array} \\ \langle \begin{array}{lcl} \langle \mathsf{r}r\_{2} \rangle & ::= & \mathsf{l} \,\mathsf{gtt} \,\langle \mathsf{gtt} \rangle \langle \mathsf{pos} \,\langle \mathsf{num} \rangle \rangle \, & \langle \mathsf{preduct} \,\langle \mathsf{r}r\_{1} \rangle \rangle \, & \langle \mathsf{id} \,\langle \mathsf{r}r\_{1} \rangle \rangle \\ & & \mid \begin{array}{lcl} \langle \mathsf{union} \langle \mathsf{r}r\_{2} \rangle \langle \mathsf{r}r\_{2} \rangle \rangle & \langle \mathsf{iflect} \,\langle \mathsf{r}r\_{2} \rangle \rangle \\ \end{array} \\ & & \mid \begin{array}{lcl} \langle \mathsf{comp} \langle \mathsf{r}r\_{2} \rangle \langle \mathsf{r}r\_{2} \rangle \rangle & \langle \mathsf{inverse} \,\langle \mathsf{r}r\_{2} \rangle \rangle \end{array} \\ \end{array} \end{array} \end{array}$$

$$\begin{array}{rcl} \langle \langle \mathsf{pos} \rangle ::= & \mathsf{=} \mid \mathsf{e} \mid \rangle & \langle \mathsf{num} \rangle ::= & \langle \mathsf{1} \mid \rangle \\\\ \langle \langle gt \rangle ::= & \langle \mathsf{root} \cdot \mathsf{step} \{ \langle trs \rangle + \} \rangle & \langle \mathsf{inverse} \, \langle gtt \rangle \rangle \\\\ & & & \langle \mathsf{acomp} \, \langle gtt \rangle \langle \langle gtt \rangle \rangle \rangle & \langle \mathsf{gcom} \, \langle gtt \rangle \rangle \, \mid \, \langle \mathsf{inter} \, \langle gtt \rangle \rangle \, \\\\ & & & \langle \mathsf{acomp1ement} \, \langle gtt \rangle \rangle \, \mid \, \langle \mathsf{etc} \, \langle gtt \rangle \rangle \, \mid \, \langle \mathsf{gtc} \, \langle gtt \rangle \rangle \, \end{array}$$

Here (terms) refers to T (F), (nf ( trs + )) to the normal forms (NF) induced by the union of the underlying TRSs, and (inf rr<sup>2</sup>) to the infinity predicate (INFR) which is satisfied by all terms having infinitely many successors with respect to the relation R. Furthermore, (proj (1 | 2)rr<sup>2</sup>) denotes projection (π) to the first (second) argument, (gtt gtt pos num) the transformation of a GTT relation into an RR<sup>2</sup> relation with corresponding context closure (cf. [14, Section 3]), (id rr<sup>1</sup>) the identity relation on the underlying set, and (gtc gtt) ((atc gtt)) the (anchored) transitive closure of the underlying (anchored) GTT relation.

The constructs defined above closely correspond to the formalized closure operations for the predicates in the first-order theory of rewriting, reported in [14] and summarized below:

$$\begin{array}{lcl} \text{marized below:} & \\ \\ A & ::= & \to\_{\epsilon} \mid A^{-} \mid A \cup A \mid A^{+} \mid A^{\widehat{+}} \mid A \circ A \mid A \bigcirc A \mid A^{c} \mid A \cap A \\\ R & ::= & A \mid R\_{p}^{n} \mid R \cup R \mid R \cap R \mid R^{-} \mid T \times T \mid = & \\ \end{array}$$

$$\begin{array}{lcl} T & ::= & \mathcal{T}(\mathcal{F}) \mid \mathbb{N}\mathbb{F} \mid \mathbb{N}\mathbb{F}\_{R} \mid T \cup T \mid T \cap T \mid T^{c} \mid \pi\_{1}(R) \mid \pi\_{2}(R) \\\ n & ::= & \geqslant \mid 1 \mid > & p \ \coloneqq & \geqslant \mid \epsilon \mid > \end{array}$$

Here A are anchored GTT relations (gtt), R are RR<sup>2</sup> relations (rr<sup>2</sup>), and T are regular sets of ground terms (rr<sup>1</sup>).

For convenience of tool authors, we add a few other constructs to rr<sup>2</sup>. The certifier expands these to a sequence of basic constructs given above.

$$\begin{aligned} \{\tau\_2\} &::= \dots \mid \{\mathsf{step} \{\langle trs \rangle + \}\} \mid \{\mathsf{step} \{\langle trs \rangle + \}\} \\ &\mid \{\mathsf{step} + \langle \langle trs \rangle + \}\} \mid \{\mathsf{step} + \}\nmid \{\mathsf{squareity}\} \\ &\mid \{\mathsf{paral1el-step} \{\langle trs \rangle + \}\} \mid \{\mathsf{root-step} + \langle \langle trs \rangle + \}\} \\ &\mid \{\mathsf{non-root-step} \{\langle trs \rangle + \}\} \mid \{\mathsf{join1}\{\langle trs \rangle + \}\} \end{aligned}$$

The complete list can be obtained from the accompanying website.

#### **4 Certificates**

A certificate for a first-order formula ϕ explains how the corresponding RR<sup>n</sup> automaton is constructed. We adopt a line-oriented natural deduction style. The automata are implicit. This is a deliberate design decision to keep certificates small. More importantly, it avoids having to check equivalence of finite tree automata, which is EXPTIME-complete [6, Section 1.7].

```
certificate ::= ( item inference formula info ∗ ) certificate
```

$$\begin{aligned} \mid \{\mathtt{empty}\langle\acute{e}item\rangle\} \mid \{\mathtt{non\mathtt{empty}}\langle\acute{e}item\rangle\} \\ \mid \langle\acute{e}item\rangle ::= \langle\acute{o}art\rangle \quad \langle\inf o\rangle ::= \{\mathtt{size}\langle\acute{o}nat\rangle\langle\acute{o}nat\rangle\} \mid \cdots \\ \mid \langle\acute{o}rm\rangle ::= \{\mathtt{r}1\langle\acute{r}r\_{1}\rangle\langle\acute{e}rm\rangle\rangle \mid \{\mathtt{r}r2\langle\acute{r}r\_{2}\rangle\langle\acute{e}rm\rangle\} \} \\ \mid \langle\acute{a}\mathtt{and}\langle\acute{e}item\rangle\*\rangle \mid \langle\acute{o}rm\rangle\*\rangle \mid \langle\acute{o}rm\rangle\rangle \\ \mid \langle\mathtt{existss}\langle\acute{e}item\rangle\rangle \mid \langle\mathtt{n}rm\rangle\rangle \mid \cdots \end{aligned}$$

Currently the info field only serves as an interface between the tool (which provides the certificate) and the certifier to compare the sizes of the constructed automata. In the future we plan to extend this field with concrete automata. This allows to test language equivalence of a tree automaton computed by a tool that supports our certificate language and the one reconstructed by FORTify, thereby providing tool authors with a mechanism to trace buggy constructions in case a certificate is rejected.

We revisit Example 1 to illustrate the construction of certificates.

Example 4. The formula ϕ = ∀ s ∃ t(s →<sup>∗</sup> t ∧ NF(t)) expressing normalization is rendered as ϕ = ∀∃(1 →<sup>∗</sup> <sup>0</sup> 0 ∧ 0 ∈ NF[0]) in de Bruijn notation. Here 1 refers to the variable s, the second and third occurrences of 0 refer to t, and the last occurrence of 0 refer to the first (and only) TRS, which has index 0. We construct the certificate bottom-up, to mimic the decision procedure. The first line is for NF[0]:

(0 (rr1 (nf (0)) 0) (rr1 (nf (0)) 0))

The components can be read as follows:


The apparent redundancy will disappear when we continue. We proceed by expressing the relation →<sup>∗</sup> <sup>0</sup> and subsequently make sure that the second component of →<sup>∗</sup> <sup>0</sup> is in normal form:

```
(1 (rr2 (step* (0)) 1 0) (rr2 (step* (0)) 1 0))
(2 (and (1 0)) (and ((rr2 (step* (0)) 1 0) (rr1 (nf (0)) 0))))
```
Line 1 is similar to line 0. The inference step and 1 0 in line 2 constructs an RR<sup>2</sup> automaton that accepts the intersection of the relations modeled in lines 1 and 0. This automaton corresponds to A<sup>4</sup> in Example 1. The cylindrification step from A<sup>1</sup> to A<sup>3</sup> in Example 1 is left implicit. We continue with the projection of variable 0 and afterwards complement the resulting automaton. This is done by an exists followed by a not inference step:


The inference steps until this point describe the construction of A<sup>6</sup> in Example 1. We complete the certificate by introducing the remaining operators:


(nonempty 7)

The nnf inference step does not modify the tree automaton computed in step 6 (which corresponds to A<sup>8</sup> in Example 1) but checks the equivalence of the formula in line 6 with the one of line 7, which corresponds to the input formula ϕ . The equivalence check incorporates ∀ elimination, negation normal form, and associativity, commutativity and idempotency of ∧ and ∨. In the future we might add support for additional equivalences in first-order logic. The final step (nonempty 7) checks that <sup>L</sup>(A8) <sup>=</sup> <sup>∅</sup>. So this certificate claims that the input TRS is normalizing. For TRSs that do not satisfy ϕ, the final line in the certificate would be (empty 7).

In the previous example we intentionally skipped over some details to convey the underlying intuition. First of all, the rr<sup>2</sup> construct (step\* (0)) is derived and internally unfolded via (anchored) GTTs into

(gtt (gtc (root-step 0)) >= >)

Starting from an anchored GTT that accepts the root step relation induced by the first (and only) TRS in the list, an application of the GTT transitive closure operation followed by a multi-hole context closure operation with at least one hole that may appear in any position, an RR<sup>2</sup> automaton that accepts the relation →<sup>∗</sup> <sup>0</sup> is constructed. We also mentioned that cylindrification is implicit. The same holds for the projection operation that is used in the exists inference steps. A projection takes place in the first component if the variable 0 is present in the list of variables, otherwise the inference step preserves the automaton. This approach is sound as variables indicate the relevant components of the RR<sup>n</sup> automaton. Thanks to the de Bruijn representation, the innermost quantifier refers to variable 0, the first component in the given RR<sup>n</sup> automaton. However we must keep track of all variables occurring in the surrounding formula and update that list accordingly.

#### **5 FORTify**

The example in the preceding section makes clear that a certificate can be viewed as a recipe for the certifier to perform certain operations on automata and formulas to confirm the final (non-)emptiness claim. In particular, checking a certificate is expensive because the decision procedure for the first-order theory is replayed using code-generated operations from a verified version of the decision procedure. In this section we describe the steps we performed to turn the Isabelle formalization of the decision procedure described in [14] into our certifier FORTify.

We use the FOL-Fitting library [3], which is part of the Archive of Formal Proofs,<sup>4</sup> to connect the first-order theory of rewriting and first-order logic. The translation is more or less straightforward. We interpret RR<sup>1</sup> constructions as predicates and RR<sup>2</sup> construction as relations in first-order logic and prove both interpretations to be semantically equivalent:

**lemma** eval formula F Rs α f = eval α undefined (for eval rel F Rs) (form of formula f )

With this equivalence we are able to define the semantics of formulas:

**definition** formula satisfiable **where** formula satisfiable F Rs f ←→ (∃ α. range α ⊆ T <sup>G</sup> F ∧ eval formula F Rs α f )

**definition** formula unsatisfiable **where** formula unsatisfiable F Rs fm ←→ (formula satisfiable F Rs fm = False)

**definition** correct certificate **where** correct certificate F Rs claim infs n ≡ (claim = Empty ←→ (formula unsatisfiable (fset F) (map fset Rs) (fst (snd (snd (infs ! n))))) ∧ claim = Nonempty ←→ formula satisfiable (fset F) (map fset Rs) (fst (snd (snd (infs ! n)))))

Last but not least we define the important function check certificate which takes as input a signature, a list of TRSs, a boolean, a formula, and a certificate. This function first verifies that the given formula and the claim corresponds to the ones referenced in the certificate and afterwards checks the integrity of the certificate. The following lemmata, which are formally proved in Isabelle, state the correctness of the check certificate function:

**lemma** check certificate F Rs A fm (Certificate infs claim n) = Some B =⇒ fm = fst (snd (snd (infs ! n))) ∧ A = (claim = Nonempty) **lemma** check certificate F Rs A fm (Certificate infs claim n) = Some B =⇒ (B = True −→ correct certificate F Rs claim infs n) ∧ (B = False −→ correct certificate F Rs (case claim of Empty ⇒ Nonempty | Nonempty ⇒ Empty) infs n)

<sup>4</sup> https://www.isa-afp.org

The first lemma ensures that our check function verifies that the provided parameters fm (formula) and A (answer satisfiable/unsatisfiable) match the formula and the claim stated in the certificate. The second lemma is the key result. It states that the check function returns Some True if and only if the certificate is correct. The only-if case is hidden in the last two lines. More precisely, if the claim of the certificate is wrong then negating the claim (the first-order theory of rewriting is complete) leads to a correct certificate. Therefore, if our check function returns Some None then the certificate is correct after negating the claim.

Our check function returns None if the global assumptions (the input TRS is not linear variable-separated, the signature is not empty, etc.) are not fulfilled. We plan to extend the check certificate function in the near future such that it reports these kind of errors.

A central part of the formalization is to obtain a trustworthy decision procedure to verify certificates. Hence we use the code generation facility of Isabelle/HOL to produce an executable version of our check certificate function. Isabelle's code generation facility is able to derive executable code for our constructions with the exception of inductively defined sets. In [8, Section 7] an abstract Horn inference system for finite sets is introduced to overcome this limitation. We use this framework to obtain executable code for the following constructions defined as inductive sets:


At this point we can use Isabelle's code generation to obtain an executable check function. However, more effort is needed to obtain an efficient check function. Checking the certificate in Example 6 below did not terminate after more than 24 hours computation time. We used the profiling capabilities of the Glasgow Haskell Compiler (GHC) to analyze the generated code. This revealed that most of the time was spent on checking membership. Since the computed tree automata can grow very large, the use of lists as underlying data structure for sets in the generated code is a bottleneck.

To overcome this problem we decided to use the container framework of Lochbihler [12]. In our case, the setup involved a non-trivial overhead as the container framework requires multiple class instances for data types used inside sets. Some of these instances could be derived automatically by the deriving framework of Sternagel and Thiemann [20]. Afterwards Isabelle's code generation was able to generate a check certificate function that uses red-black trees as underlying data structure for sets.

Sadly, the function was still infeasible for the certificate in Example 6. This time the power set construction, which is exponential in worst case, turned out to be the culprit. In this construction we compute the transitive closure of the present epsilon transitions multiple times. Adding an explicit construction to

**Fig. 1.** Certificate validation with FORTify.

remove epsilon transitions from tree automata solved this issue. To make a long story short, after further modifications we were able to verify the certificate for Example 6 in a little less than 3 minutes, which we consider fast enough for a first prototype. The resulting code-generated certifier is called FORTify.

The overall design of FORTify is shown in Figure 1. It can be viewed as two separate modules A and B. Module B is the verified Haskell code base that is generated by Isabelle's code generation facility, containing the check certificate function and the data type declarations for formulas and certificates. To use this functionality, we wrote a parser which translates strings representing formulas (signatures, TRSs, certificates) to semantically equivalent formulas (signatures, TRSs, certificates) represented in the data types obtained from the generated code. This was done in Haskell and refers to module A in Figure 1. Module A accepts formulas in FORT syntax. Hence it also applies the conversion to the de Bruijn representation. After the translation in module A, the check certificate function in module B is executed and its output is reported.

Importantly, the code in module A is not verified in Isabelle. Correctness of FORTify must therefore assume correctness of module A as well as the correctness of the Glasgow Haskell Compiler, which we use to generate a standalone executable from the generated code.

#### **6 FORT-h**

FORT-h is a new decision tool for the first order theory of rewriting. It is a reimplementation of the decision mode of the previous FORT tool [18] based on a modified decision procedure. The decision procedure, like the formalization, is based on anchored GTTs. The new tool is implemented in Haskell whereas FORT is written in Java.

FORT-h supports all features of FORT while extending the domain of supported TRSs from left-linear right-ground TRSs to linear variable-separated ones. While FORT could technically take such TRSs as input, it is unsound when checking non-ground properties on them.

**Fig. 2.** Interface of FORT-h.

Example 5. To check confluence of the linear variable-separated TRS

$$\mathbf{g(g(x)) \to g(y)}\qquad\qquad\qquad\qquad\qquad\qquad\mathbf{a \to g(a)}$$

FORT-h can be called with

> ./fort-h "CR" input.trs NO

where input.trs is a text file containing the rewrite system. The tool correctly states that NO the system is not confluent. However, FORT incorrectly identifies this as confluent due to the lack of support for variables appearing in right-hand sides of rules.

FORT-h took part in the 2020 edition of the Confluence Competition, competing in five categories: COM, GCR, NFP, UNC and UNR. Even though it does not support many problems tested in the competition, due to the restriction to linear variable-separated TRSs, it was able to win the category for most YES results in UNR. The tool expects as input a formula ϕ and one or more TRSs, as seen in Figure 2. It then outputs the answer YES or NO depending on whether ϕ is satisfied or not by the given TRSs. FORT-h may be passed some additional options:




As an example of the latter, consider Example 5 and the call

```
> fort-h -w "CR" input.trs
NO
formula body / witness:
    (0 (<- o ->*) 1 & ~ 0 (->* o *<-) 1)
    0 = g(_00())
    1 = g(_01())
```
So in addition to the answer NO, it also outputs a counter example for the given formula consisting of the two terms g( 00()) and g( 01()). Here 00 and 01 are additional constants required to reduce confluence to ground-confluence, and represent variables. The terms should therefore be read as g(x) and g(y).

Internally FORT-h represents formulas using de Bruijn indices as described in Section 4. Additionally, universal quantifiers and implications are eliminated, and negations are pushed as far as possible to the atomic subformulas. The tool then traverses the formula in a bottom-up fashion, constructing the corresponding anchored GTTs and RR<sup>n</sup> automata. During this traversal we also keep track of the steps taken, to construct the certificate if necessary. To improve performance the automata are cached and reused for equal subformulas. The tree automaton representing the whole formula is then checked for emptiness. If the accepted language is empty, FORT-h reports NO, otherwise it outputs YES.

#### **7 Experiments**

The experiments described in this section were run on a computer with a Intel(R) Core(TM) i7-5930K CPU with 6 cores at 3.50GHz.

In the 2019 edition of the Confluence Competition [15] three tools contested the commutation (COM) category:<sup>5</sup> ACP [1], CoLL [19], and FORT. On input problem COPS #1118 the tools gave conflicting answers.

Example 6. COPS #1118 is about the commutation of the TRSs COPS #669

a → c f(a) → b b → b b → h(b, h(c, a))

and COPS #695

h(a, a) → c b → h(b, a) b → a f(c) → c c → a

To determine the correct answer we use FORT-h to produce a certificate for ground-confluence by calling

```
> fort-h -c cert -i "GCom([0],[1])" 1118.trs
YES
```
This produces the following certificate:

```
(0 (rr2 (comp (inverse (step* (1))) (step* (0))) 0 1)
   (rr2 (comp (inverse (step* (1))) (step* (0))) 0 1)
   (size 13 53 0))
(1 (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1)
   (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1)
   (size 11 47 0))
(2 (not 1) (not (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1)))
(3 (and (0 2))
   (and ((rr2 (comp (inverse (step* (1))) (step* (0))) 0 1)
      (not (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1)))))
(4 (exists 3)
```

```
5 https://cops.uibk.ac.at/results/?y=2019&c=COM
```


**Table 1.** FORT(-h) run on GCR formulas with a 60 s timeout (FORTify with 600 s).

```
(exists (and ((rr2 (comp (inverse (step* (1))) (step* (0))) 0 1)
      (not (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1))))))
(5 (exists 4)
   (exists (exists (and ((rr2 (comp (inverse (step* (1)))
      (step* (0))) 0 1) (not (rr2 (comp (step* (0))
      (inverse (step* (1)))) 0 1)))))))
(6 (not 5)
   (not (exists (exists (and (
      (rr2 (comp (inverse (step* (1))) (step* (0))) 0 1)
      (not (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1))))))))
(7 (nnf 6)
   (forall (forall (or (
      (not (rr2 (comp (inverse (step* (1))) (step* (0))) 0 1))
      (rr2 (comp (step* (0)) (inverse (step* (1)))) 0 1))))))
(nonempty 7)
```
When passing this certificate to FORTify, after 2 minutes and 57 seconds the output Certified is produced, so we can be assured that the TRSs do commute. Note that the inference steps 0 and 1 contain the optional size information. Here (size k m n) means the underlying RR<sup>n</sup> automaton constructed by FORT-h contains k final states, m transitions, and n epsilon transitions.

We also ran some experiments comparing FORT-h to FORT. The problems for these experiments are taken from the Confluence Problems database (COPS), and consists of 122 left-linear right-ground TRSs. Note that FORT-h implements no parallelism, while FORT does. For the first two experiments we chose a timeout of 60 seconds for the decision tools and 600 seconds for FORTify. The formulas were taken from the experiments reported in [17]. The first three

$$\forall s \forall t \forall u \left(s \to^\* t \land s \to^\* u \implies t \downarrow u\right) \tag{1}$$

$$\forall s \forall t \forall u \left(s \to^\* t \land s \to u \implies t \downarrow u\right) \tag{2}$$

$$\forall t \forall u \left( t \leftrightarrow^\* u \implies t \downarrow u \right) \tag{3}$$

denote different but equivalent formulations of ground-confluence (GCR).

The results are shown in Table 1, where the YES (NO) column shows the number of systems determined to be (non-)ground-confluent together with average time (∅-time) the tool took. The <sup>∞</sup> column is the number of timeouts. To compare overall performance the total time column contains the sum of all runtimes, including timeouts but excluding the time taken by FORTify. The ✔ columns show the numbers of certifiable results as well as the overall time taken by FORTify (✔-time). These results show that, even though they have the same meaning, the choice of formula has an impact on performance. Interestingly FORT-h is generally faster and can solve more problems than FORT even though it can not take advantage of any parallelism. This performance advantage is more prominent in systems which are non-confluent. For problems with the answer YES, FORT can still prove more. The table also shows that FORTify can only certify a small portion the results. This is due to the performance of the certifier, since all other problems time out. It is also apparent that formulas containing conversion (↔<sup>∗</sup>) are especially slow. No wrong results by the decision tools where identified.

The second set of formulas represents the normal form property, restricted to ground terms (GNFP):

$$\forall t \forall u \left( t \leftrightarrow^\* u \land \mathsf{NF}(u) \implies t \to^\* u \right) \tag{4}$$

$$\forall s \forall t \forall u \left(s \to t \land s \to^! u \implies t \to^\* u\right) \tag{5}$$

$$\forall t \left( \mathsf{WN}(t) \implies \mathsf{CR}(t) \right) \tag{6}$$

The results for these are shown in Table 2. The same pattern is observed, where even though both can (dis)prove satisfaction for the same formulas, FORT-h is faster overall.

For the last experiment we test performance on properties over two TRSs. This is done by checking ground-commutation (GCOM) for all pairs of systems form the dataset, resulting in 7503 problems. A timeout of 60 seconds was used. The results, presented in Table 3, show that FORT-h is ahead here as well, (dis)proving more problems and doing so in significantly less time.

Full details of the experiments are available from the website<sup>6</sup> accompanying this paper. Precompiled binaries of FORT-h and FORTify are available from the same site. We also present a few additional experiments with FORTify.

<sup>6</sup> https://fortissimo.uibk.ac.at/tacas2021





#### **8 Conclusion**

In this paper we presented FORTify, a certifier for the first-order theory of rewriting for linear variable-separated TRSs, together with an expressive certificate language for formulas and proofs. Moreover, a new implementation of the decision procedure for the theory of rewriting, FORT-h, is capable of producing certificates in this language.

We mention three topics which require further research. First of all, many certificates produced by FORT-h cannot be validated by the current version of FORTify within reasonable time. We will further improve the algorithms and data structures used in the check-certificate function. A natural candidate for optimization is the transitive closure algorithm generated by Isabelle, which always takes cubic time. Currently, sharing only takes place in the inference rules. Expanding this to the individual constructions will be the next step. Also trimming of anchored GTTs could improve the run time. In the current state of the formalization only trimming of GTTs is proved to be sound. Profiling will be used to determine other candidates that are likely to have a large impact on the validation time.

A second topic for future research is the certification of properties on open (i.e., non-ground) terms. In [8,16,18] conditions are presented to reduce properties related to confluence to the corresponding properties on ground terms, by adding additional constants to the signature. These results need to be formalized in Isabelle and the certificate language needs to be extended, before FORTify can be used to certify the corresponding categories in the Confluence Competition. We plan to define signature extensions directly in formulas, to offer the most flexibility. A related issue is the support for many-sorted signatures in the Isabelle formalization. FORT-h already supports many-sorted TRSs, which is the format in the GCR category of CoCo.

A third topic is improving the efficiency of FORT-h. We anticipate that supporting parallelism will further speed up FORT-h, especially for large formulas. Preprocessing techniques that go beyond the mere transformation to negation normal form will be helpful to obtain equivalent formulas that reduce the size of the ensuing tree automata in the decision procedure. In [10] similar ideas are applied to WSkS, in connection with MONA [11].

Acknowledgments. We thank Ren´e Thiemann for giving valuable advice on how to improve the efficiency of the generated code. The comments by the anonymous reviewers improved the presentation.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Syntax-Guided Quantifier Instantiation***-*

Aina Niemetz<sup>1</sup> , Mathias Preiner1(-) , Andrew Reynolds<sup>2</sup> , Clark Barrett<sup>1</sup> , and Cesare Tinelli<sup>2</sup>

<sup>1</sup> Stanford University, Stanford, USA preiner@cs.stanford.edu <sup>2</sup> The University of Iowa, Iowa City, USA

**Abstract.** This paper presents a novel approach for quantifier instantiation in Satisfiability Modulo Theories (SMT) that leverages syntaxguided synthesis (SyGuS) to choose instantiation terms. It targets quantified constraints over background theories such as (non)linear integer, reals and floating-point arithmetic, bit-vectors, and their combinations. Unlike previous approaches for quantifier instantiation in these domains which rely on theory-specific strategies, the new approach can be applied to any (combined) theory, when provided with a grammar for instantiation terms for all sorts in the theory. We implement syntax-guided instantiation in the SMT solver CVC4, leveraging its support for enumerative SyGuS. Our experiments demonstrate the versatility of the approach, showing that it is competitive with or exceeds the performance of stateof-the-art solvers on a range of background theories.

#### **1 Introduction**

Modern Satisfiability Modulo Theories (SMT) solvers are highly efficient tools, capable of reasoning about constraints over a wide range of logical theories, including (non-linear) real and integer arithmetic, fixed-size bit-vectors, and floating-point arithmetic. Their core algorithms are designed primarily for quantifier-free constraints, but various extensions have been shown to work well also for quantified constraints in many cases. Quantified reasoning in SMT has many practical applications, including software verification, automated theorem proving, and synthesis.

Current SMT solvers handle quantified constraints in a variety of ways, with a degree of effectiveness that usually depends on the background theory. For instance heuristic instantiation techniques such as E-matching [15] are used for quantified formulas with heavy use of uninterpreted functions. These heuristic instantiation techniques are refutationally incomplete but they can be highly effective, in particular in the context of verification applications. For quantified constraints over a particular background theory, such as linear arithmetic or fixed-size bit-vectors, on the other hand, SMT solvers resort to an entirely different set of techniques. While also based on quantifier instantiation, these other

<sup>-</sup> This work was supported in part by DARPA (award no. FA8650-18-2-7861), NSF (award no. 1656926) and ONR (award no. N68335-17-C-0558).

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 145–163, 2021. https://doi.org/10.1007/978-3-030-72013-1 8

techniques tend to be counterexample-guided and can be complete for theories and fragments of first-order logic that admit quantifier elimination.

Specific previous work in the latter direction includes counterexample-guided quantifier instantiation techniques for linear arithmetic [25] and fixed-size bitvectors [18,20]. The key to developing each of them is to devise an appropriate, theory-specific selection function, which determines a term selection strategy for instantiating universal quantifiers. For some logics, e.g., linear arithmetic, selection functions can be based on the notion of elimination set found in classic algorithms for quantifier elimination [9, 14]. However, since many theories used in practice do not admit quantifier elimination, the design of a good selection function is usually non-trivial. These challenges are further magnified when reasoning in combinations of multiple theories.

We propose a novel, syntax-guided quantifier instantiation (SyQI) approach, which is both general-purpose and highly effective for quantified formulas in background theories such as (non)linear integer, reals and floating-point arithmetic, and their combinations. The new approach leverages an embedding of a solver for the syntax-guided synthesis (SyGuS) problem [1] within an SMT solver in order to choose terms for quantifier instantiation in a counterexample-guided manner. It is theory-agnostic and only requires the specification, via a grammar, of the set of terms to consider for each sort in the theory when instantiating quantifiers.<sup>3</sup> Since it can be applied to quantified formulas in any background theory, it is more general in scope than previous work [20]. Our approach is intended for logics such as quantified floating-point arithmetic, which would benefit from counterexample-guided quantifier instantiation, but for which appropriate selection function are not obvious. We show that the use of syntax-guided synthesis gives us the flexibility to develop variants of our approach that are highly competitive with the state of the art in SMT solving. More specifically, this paper makes the following contributions:


Related Work. Handling quantified formulas in SMT solvers is a long-standing challenge. Early approaches for quantified formulas were largely based on Ematching [8, 10, 15]. They have been later supplemented with techniques that rely on models for establishing satisfiability [11, 26], and on conflict finding to accelerate the search for unsatisfiability [27]. Pragmatic enumerative approaches

<sup>3</sup> Our implementation provides a default grammar for all supported sorts. In general, grammars can also be provided by the user. We do not explore this option here.

for quantifier instantiation have also been explored and shown to increase the precision of SMT solvers on inputs involving uninterpreted functions where Ematching is incomplete [21]. The approach we describe here is also enumerative in nature; however, it leverages syntax-guided synthesis for choosing instantiations and does not target inputs with uninterpreted functions.

For quantified formulas over a single background theory, counterexampleguided approaches have been considered by Bjørner and Janota [6] and by Reynolds et al. [25], targeting primarily quantified linear integer/real arithmetic. For theories of other data types (e.g., bit-vectors), most approaches use valuebased instantiation, where concrete variable assignments for a set of quantifierfree formulas derived from the negation of the input formula (the counterexamples) provide instantiations for the universal variables. In the SMT solver Z3 [16], model-based quantifier instantiation (MBQI) [11] is combined with a templatebased model finding procedure [29]. A recent line of work by Niemetz et al. [18] leverages invertibility conditions in a counterexample-guided loop for quantifier instantiation of formulas in the theory of fixed-size bit-vectors. Brain et al. [7] lift the concept of invertibility conditions to the theory of floating-point arithmetic and presented a preliminary quantifier elimination procedure for a fragment of the theory based on these conditions. Another approach for lazy quantifier elimination for bit-vector formulas is explored by Vediramana Krishnan et al. [12], based on iterative approximate quantifier elimination.

Reynolds et al. [24] leverage counterexample-guided quantifier instantiation (CEGQI) to efficiently solve a restricted but practically useful form of syntaxguided synthesis problems. In contrast, the work in this paper has the dual goal of leveraging enumerative syntax-guided synthesis to establish a strategy for quantifier instantiation of (first-order) quantified formulas.

SyGuS techniques to solve quantified problems were previously explored by Preiner et al. in [20]. However, instead of focusing on quantifier instantiation they combined enumerative syntax-guided synthesis with value-based quantifier instantiation to synthesize Skolem functions for existential variables.

#### **2 Background**

We assume the usual notions and terminology of many-sorted first-order logic with equality (denoted by ≈). Let S be a set of sort symbols. For every σ ∈ S, let X<sup>σ</sup> be an infinite set of variables of sort σ. Let X = % <sup>σ</sup>∈<sup>S</sup> <sup>X</sup>σ. Let <sup>Σ</sup> be a signature consisting of a set <sup>Σ</sup><sup>s</sup> <sup>⊆</sup> <sup>S</sup> of sort symbols and a set <sup>Σ</sup><sup>f</sup> of interpreted (and sorted) function symbols <sup>f</sup><sup>σ</sup>1···σn<sup>σ</sup> with arity <sup>n</sup> <sup>≥</sup> 0 and <sup>σ</sup>1, ..., σn, σ <sup>∈</sup> <sup>Σ</sup><sup>s</sup>. We assume that Σ includes a Boolean sort Bool and the Boolean values ! (true) and <sup>⊥</sup> (false). Let <sup>I</sup> be a <sup>Σ</sup>-interpretation that maps: each sort <sup>σ</sup> <sup>∈</sup> <sup>Σ</sup><sup>s</sup> to a nonempty set σ<sup>I</sup> (the domain of I), with Bool<sup>I</sup> = {!, ⊥}; each variable x ∈ X<sup>σ</sup> to an element <sup>x</sup><sup>I</sup> <sup>∈</sup> <sup>σ</sup>I; and each function <sup>f</sup><sup>σ</sup>1···σn<sup>σ</sup> <sup>∈</sup> <sup>Σ</sup><sup>f</sup> to a total function fI: σ<sup>I</sup> <sup>1</sup> × ... × σ<sup>I</sup> <sup>n</sup> → σ<sup>I</sup> if n > 0, and to an element in σ<sup>I</sup> if n = 0.

We assume the usual definition of well-sorted terms, literals, and formulas as Bool terms with variables in X and symbols in Σ, and refer to them as Σ- terms, Σ-atoms, and so on. A ground term/formula is a Σ-term/formula without variables. We define *x* = (x1, ..., xn) as a tuple of variables and write Q*x*.ϕ with Q ∈ {∀, ∃} for a quantified formula Qx1. ··· Qxn.ϕ. A formula is universal if it has the form ∀*x*. P where P is a quantifier-free formula. For simplicity, we consider only universal quantifiers since existential quantifiers can be rewritten in terms of universal ones. We use Lit(ϕ) to denote the set of Σ-literals of Σformula ϕ. For a Σ-term or Σ-formula e, we use e[*x*] to indicate that the free variables of e are in *x*. For a tuple of Σ-terms *t* = (t1, ..., tn), we write e[*t*] for the term or formula obtained from e by simultaneously replacing each occurrence of x<sup>i</sup> in e by ti. If t is a Σ-term/formula and I a Σ-interpretation, we write t <sup>I</sup> to denote the meaning of t in I. We use the usual inductive definition of a satisfiability relation |= between Σ-interpretations and Σ-formulas.

A theory T is a pair (Σ,I), where Σ is a signature and I is a non-empty class of Σ-interpretations (the models of T) that is closed under variable reassignment, i.e., every Σ-interpretation that only differs from an I ∈ I in how it interprets variables is also in I. A Σ-formula ϕ is T-satisfiable (resp. T-unsatisfiable) if it is satisfied by some (resp. no) interpretation in I; it is T-valid if it is satisfied by all interpretations in I.

Enumerative SyGuS using an Embedding into Datatypes. A syntax-guided synthesis problem for an n-ary function f in a background theory T consists of a set of semantic restrictions (a specification) for f, given as a (second-order) T-formula of the form ∃f.ϕ[f], and a set of syntactic restrictions on the solutions for f, typically expressed as a context-free grammar. A solution to such a problem is a term t[x1,...,xn] that satisfies the syntactic restrictions and is such that the formula ϕ[λx1,...,xn.t] is T-valid.

As shown in previous work [24], syntactic restrictions for the bodies of functions to synthesize can be conveniently represented as a set of (algebraic) datatypes. The setting in this paper is simpler. Instead of synthesizing terms corresponding to function bodies, we use context-free-grammars for defining a set of (first-order) terms in a given theory, possibly containing free function symbols. For instance, let a and b be free constants of sort Int. The context-free grammar R below specifies a set of integer (Z) and Boolean (B) terms:

$$Z ::= 0 \quad | \quad 1 \quad | \quad a \quad | \quad b \quad | \quad Z + Z \quad | \quad Z - Z \quad | \quad \text{ite}(B, Z, Z) \tag{1}$$

$$B \implies B \ge B \quad | \quad Z \approx Z \quad | \quad \neg B \quad | \quad B \land B \tag{2}$$

Given such a grammar, our SyGuS solver generates the following mutually recursive datatypes:

$$\mathcal{Z} = \mathbf{zero} \mid \text{one} \mid \text{a} \mid \text{b} \mid \text{plus}(\mathcal{Z}, \mathcal{Z}) \mid \text{minus}(\mathcal{Z}, \mathcal{Z}) \mid \text{ite}(\mathcal{B}, \mathcal{Z}, \mathcal{Z}) \quad (3)$$

$$\mathcal{B} = \text{geq}(\mathcal{Z}, \mathcal{Z}) \mid \text{eq}(\mathcal{Z}, \mathcal{Z}) \mid \text{not}(\mathcal{B}) \mid \text{ and}(\mathcal{B}, \mathcal{B}) \tag{4}$$

Each datatype constructor, listed on the right-hand side of each equation, corresponds to a production rule of R, e.g., plus corresponds to the rule Z ::= Z + Z. Given a datatype value v, we write **to term**(v) to denote the term that v represents, e.g., **to term**(plus(a, b)) is the term a + b.

In previous work [22, 24], a smart enumerative approach for syntax-guided synthesis was presented and implemented in CVC4. In that work, the generation of terms is based on finding solutions for an evolving set of constraints in an extension of the quantifier-free fragment of algebraic datatypes, for which some SMT solvers have dedicated decision procedures [3, 23]. In the remainder of this paper, we write T<sup>D</sup> to denote the theory of datatypes over a signature Σ<sup>D</sup> of constructor and selector symbols. The signature Σ<sup>D</sup> includes (parametric) datatype sorts that are interpreted as the universe of a term algebra over the constructors. Selectors are interpreted as functions that extract the immediate subterms of a constructor term.

In our setting, datatype constraints are used to express syntactic restrictions on the terms in the original theory. For instance, in case of the example theory and corresponding datatypes Z and B defined above, we can write a datatype constraint that is falsified by all terms of the form plus(zero, t) where t is a constructor term of sort Z. This corresponds to ruling out terms of the form (0 + ...) in the original theory where s is a term of sort Int. In more detail, for a datatype term d, we write isC(d) to denote the discriminator predicate, which is satisfied exactly when d is interpreted as a datatype value whose top constructor is C. We write selσ,n(d) to denote a shared selector [28] applied to d, interpreted as the nth child of d with sort σ if one exists, and as an arbitrary element of σ otherwise. These symbols are used for constructing blocking constraints. For example, we can write ¬isplus(d) ∨ ¬iszero(selZ,1(d)) to state the constraint above that d cannot be interpreted as any datatype value corresponding to an Int term of the form (0 + ...). In the context of syntax-guided synthesis, a constraint like this is added, for instance, to filter out redundant terms (like 0 + ...) or terms already known to falsify the synthesis conjecture.

Our approach for syntax-guided instantiation relies on a notion of evaluation variables. A related, more general, notion of evaluation functions was used in the context of syntax-guided synthesis (see Section 2 of [22] for details). Let d be a term of a datatype sort encoding a grammar over terms of sort σ. We write e<sup>d</sup> to denote a free constant of sort σ, which we call the evaluation variable for d. We use evaluation variables to determine which terms to use in instantiations of quantified formulas. The algorithm given in the following section will add constraints that force the interpretation of e<sup>d</sup> to be equal to **to term**(dI) in interpretations I. A simple example of such a constraint is isa(d) ⇒ e<sup>d</sup> ≈ a, stating that the evaluation variable e<sup>d</sup> for d is equal to the free constant a of integer type when d is interpreted as the datatype value a.

#### **3 SyGuS Quantifier Instantiation (SyQI)**

Our new SyGuS-based instantiation approach combines counterexample-guided quantifier instantiation (CEGQI) with smart enumerative SyGuS techniques to synthesize terms for quantifier instantiation. In essence, it is an algorithm that tries to synthesize a term t for a variable x in a given formula ∀x. P[x] such that ¬P[t] holds. For synthesis purposes, each quantified variable is associated with


a SyGuS grammar based on the sort of the variable. For example, our algorithm uses a bit-vector-specific grammar to synthesize bit-vector terms as possible instantiations of quantified variables of bit-vector sort. Our SyGuS solver suggests instantiations based on such grammars and an evolving set of constraints on the instance term. The main advantage of this instantiation approach is that it does not require theory-specific quantifier instantiation algorithms. Its only theory-specific aspects are the construction of the grammar for each theory sort and the satisfiability checks performed on the generated instances.

Algorithm 1 shows the two main procedures **syqi** and **select lemmas**<sup>L</sup> of our SyGuS instantiation approach. To simplify the exposition, we describe the restricted case where the quantified input formula are all universal. Our implementation in CVC4, however, applies to the general case through a lazy conversion to DNF and resolution of quantifier alternations.

Procedure **syqi** takes as argument a set {Q1,...,Qn} of universal (quantified) T-formulas and a set G of ground T-formulas. As an initial step, and prior to solving the problem, we generate a lemma for each quantified formula Q<sup>i</sup> as part of our counterexample-guided quantifier instantiation approach (lines 2-5). We first create a fresh datatype constant <sup>d</sup><sup>x</sup> of sort **grammar**<sup>S</sup> (x) for each variable <sup>x</sup> <sup>∈</sup> *<sup>x</sup>* in each input formula <sup>∀</sup>*x*. P[*x*]. The datatype sort **grammar**<sup>S</sup> (x) is constructed from a SyGuS grammar determined by the sort of variable x. The language generated by the grammar includes ground terms from Q<sup>i</sup> and G of the same sort. These terms are chosen following a selection strategy S, which we describe in Section 3.1. Apart from running **check**, used as a black box, **grammar**<sup>S</sup> implements the only theory-specific handling of our procedure. Finally, we add to G a lemma of the form l<sup>i</sup> ⇒ ¬P[**e***<sup>d</sup><sup>x</sup>* ] for each quantified formula, where l<sup>i</sup> is a fresh Boolean constant (the counterexample literal for Qi). Thanks to l<sup>i</sup> being fresh, this preserves the satisfiability of G. The notation **e***d<sup>x</sup>* is a shorthand for (e<sup>d</sup>x<sup>1</sup> ,..., e<sup>d</sup>xm ), the tuple of evaluation variables for each d<sup>x</sup> of x ∈ *x*. The purpose of a counterexample lemma is twofold. First, it indicates whether a quantified formula Q<sup>i</sup> is active (l<sup>i</sup> assigned to true) or inactive (l<sup>i</sup> assigned to false). Second, it focuses on finding counterexamples that falsify the body of Qi.

The main loop of procedure **syqi** is provided in lines 6-11. Each iteration starts with a quantifier-free satisfiability check (performed by procedure **check** on line 7) on the current set of ground formulas G in the combined theory T ∪ TD. If G is unsatisfiable, procedure **syqi** returns unsat. If G is satisfiable, the procedure further checks whether it can find a counterexample for any of the quantified formulas Q1,...,Qn, which is done by checking the satisfiability of G∧(l1∨...∨ln). If the check returns unsat then no more counterexamples can be found; the algorithm concludes that input set is satisfiable and returns sat. The reason is that, in this case, the set G is satisfiable and entails each input formula, as proven later in this section. If the second call to **check** (line 8) returns sat, it additionally returns (a finite representation of) a model I for the current set of ground formulas G. Since I satisfies l<sup>1</sup> ∨ ... ∨ ln, it does not satisfy at least one quantified formula in Q1,...,Qn. <sup>4</sup> For each active quantified formula in <sup>I</sup>, we generate new lemmas via procedure **select lemmas**<sup>L</sup> (lines 10-11), and repeat the main loop of the algorithm. Note that the second satisfiability check can be avoided by employing a special decision heuristic for counterexample literals l<sup>i</sup> in the SAT solver. The decision heuristic will always assign a counterexample literal l<sup>i</sup> to true on a decision. Consequently, l<sup>i</sup> can only be assigned to false in a candidate interpretation I if ¬l<sup>i</sup> is entailed by the set of ground formulas G.

Procedure **select lemmas**<sup>L</sup> takes a formula ∀*x*. P[*x*] and a model I as arguments and generates a set of lemmas based on I and selection strategy L. The procedure maintains the invariant of always returning a set of lemmas L where L \ G is non-empty. This set L includes a single instantiation lemma (of the form P[*t*]) and an evaluation unfolding lemmas (see below) for each variable x ∈ *x*. The returned lemmas are generated based on one of three lemma selection strategies: priority-inst, priority-eval, and interleave. Strategy interleave selects both the instantiation lemma and a set of evaluation unfolding lemmas at the same time. Strategies priority-inst and priority-eval give priority to instantiation lemmas and evaluation unfolding lemmas, respectively; i.e., strategy priority-inst selects the instantiation lemma and only selects evaluation unfolding lemmas if the instantiation lemma was already in G. Analogously, priority-eval gives priority to evaluation unfolding lemmas.

The various lemmas are constructed as follows. For each variable x ∈ *x* we use the model value d<sup>I</sup> <sup>x</sup> of datatype constant d<sup>x</sup> to construct the corresponding term **to term**(d<sup>I</sup> <sup>x</sup>) in the theory of variable x (line 15). The constructed term corresponds to a term synthesized by the SyGuS extension of our datatypes

<sup>4</sup> Note that this does not mean the quantified formula is unsatisfiable, only that it is not satisfied in I.

solver based on the grammar specified for x. To ensure that d<sup>x</sup> evaluates to the same values as term **to term**(d<sup>I</sup> <sup>x</sup>) under model value d<sup>I</sup> <sup>x</sup>, we generate the evaluation unfolding lemma **explain**(d<sup>x</sup> ≈ d<sup>I</sup> <sup>x</sup>) ⇒ e<sup>d</sup><sup>x</sup> ≈ **to term**(d<sup>I</sup> <sup>x</sup>). The explanation for the model value dI <sup>x</sup> is expressed in terms of discriminator predicates. For example, if value dI <sup>x</sup> represents term a + b, the procedure generates lemma isplus(dx) ∧ isa(selZ,1(dx)) ∧ isb(selZ,2(dx)) ⇒ e<sup>d</sup><sup>x</sup> = a + b. As a last step, **select lemmas**<sup>L</sup> selects a non-empty subset of the generated instantiation lemma P[t1,...,tp] (where each t<sup>i</sup> is **to term**(d<sup>I</sup> <sup>x</sup><sup>i</sup> )) and the evaluation unfolding lemmas L according to the lemma selection strategy L.

We now discuss the correctness properties of our approach. In the following, we say a grammar R for sort σ is complete, if for all interpretations I and values v of sort σ, it generates at least one term t such that t <sup>I</sup> = v. Note that we only consider complete grammars in this paper. We say a lemma selection strategy L is fair wrt a set of formulas G if it returns a set of lemmas that contain at least one lemma inequivalent to each formula in G whenever such lemma exists.

**Theorem 1.** Let T be a theory with signature Σ, let F be a set of universal formulas {Q1,...,Qn} and G<sup>0</sup> is a set of quantifier-free formulas. If all grammars constructed by the calls to *grammar*<sup>S</sup> in *syqi* are complete and the selection strategy L used for *select lemmas*<sup>L</sup> is fair, then the following statements hold:


Conceptually, the proof of refutational soundness relies on the fact that all lemmas added to G are entailed by the input or maintain equisatisfiability with respect to the input. The proof of model soundness relies on the fact that when G collectively entails the negation of (all) quantified formulas, then the current model I for G must be a model for all quantified formulas. Procedure **syqi** is not terminating in general. However, the progress property guarantees that the algorithm does not get stuck in a single state and keeps making progress towards refining the set of possible models by ruling out at least one candidate model at each iteration of the procedure's main loop.

Proof. For brevity, we show these statements for the case of n = 1 and where Q<sup>1</sup> is ∀*x*. P[*x*]; the proof can be easily lifted to n > 1. When **syqi**(F, G0) terminates, the internal set G is the union of:


To show (1), assume that ϕ is satisfied by some Σ-interpretation J , where without loss of generality assume that l <sup>J</sup> is false. Let I be a Σ∪ΣD-interpretation that extends J such that for each evaluation variable ed, the interpretation of d in I is such that **to term**(dI)<sup>I</sup> = e<sup>I</sup> <sup>d</sup> . Such a value exists since our grammars are complete by assumption. We show that I satisfies each formula ψ in G. If ψ ∈ G0, then this holds since J satisfies ϕ, and hence, by extension I does as well. If ψ ∈ Gcex, then ψ is satisfied by I since it interprets l<sup>i</sup> as false. If ψ ∈ Ginst is an instantiation lemma of some Qi, then it is satisfied by I since J also satisfies Qi. If ψ ∈ Gev is an evaluation lemma, this is satisfied by our construction of dI. Thus ϕ is T-satisfiable, then G must be (T ∪ TD)-satisfiable. Thus, since **syqi**(F, G0) returns unsat when G is (T ∪ TD)-unsatisfiable, this means that F ∪ G<sup>0</sup> must be T-unsatisfiable as well.

To show (2), if **syqi**(F, G0) returns sat, then the set G is satisfied by some Σ∪ΣD-interpretation and G∪{l1} is unsatisfiable. Let J be the Σ-interpretation that interprets all symbols in Σ the same as in I. Since G∪ {l1} is unsatisfiable, we have that G<sup>0</sup> ∪ Ginst ∪ Gev ∪ {¬P[**e***<sup>d</sup><sup>x</sup>* ]} is T ∪ TD-unsatisfiable. Since all Σinterpretations can be lifted to a Σ ∪ ΣD-interpretation satisfying Gev, it must also be the case that G0∪Ginst∪ {¬P[**e***<sup>d</sup><sup>x</sup>* ]} is T-unsatisfiable. Hence, all models of G<sup>0</sup> ∪ Ginst must make P[**e***<sup>d</sup><sup>x</sup>* ] true. Since **e***<sup>d</sup><sup>x</sup>* does not occur in G<sup>0</sup> ∪ Ginst, this implies that all models of G<sup>0</sup> ∪ Ginst satisfy ∀*x*. P[*x*]. Since G<sup>0</sup> ∪ Ginst ⊆ G and I satisfies G, we have that J satisfies {∀*x*.P[*x*]} ∪ G.

To show (3), assume ad absurdum that G is satisfied by a T ∪ TD-interpretation I where **to term**(*d<sup>x</sup>* <sup>I</sup>) = *t* and Q<sup>1</sup> is active in I. Also assume that G contains the evaluation unfolding lemmas for *d<sup>x</sup>* <sup>I</sup> and the instantiation lemma P[*t*]. Due to the former, we have that **e***<sup>d</sup><sup>x</sup>* <sup>I</sup> = *t*I. Since Q<sup>1</sup> is active in I, I satisfies ¬P[**e***<sup>d</sup><sup>x</sup>* ]. However, P[*t*] is also satisfied by I, a contradiction. Thus, at least one of the lemmas returned by **select lemmas**<sup>L</sup> for Q<sup>1</sup> must be inequivalent to the lemmas in G, due to our assumption that L is a fair selection strategy.

#### **3.1 Grammar Construction**

For quantifier instantiation, we focus on the theories of fixed-size bit-vectors, floating-point numbers, integers, and reals as defined by the SMT-LIB 2 standard [4]. The signature of the theory of fixed-size bit-vectors includes a unique sort for each positive bit-vector width n, denoted here as BV[n]. The signature of the theory of floating-point numbers includes a rounding-mode sort RM and a unique floating-point sort for each combination of positive exponent width e and significand width s, denoted here as FP[e,s]. The theories of Integers and Reals include the integer sort Int and the real sort Real, respectively. For each of these sorts we define a SyGuS grammar that includes the following operators and constants.

RBV : {∼ , −, &, |, ⊕, +, ·, ÷, ÷s, mod, mod<sup>s</sup> <<, >>, >>a, 0, 1, ones, smin, smax} RFP : {−, abs, rem, <sup>√</sup>, rti, <sup>+</sup>, ·, <sup>÷</sup>, fma, NaN, ±∞, <sup>±</sup>0, <sup>±</sup>min<sup>s</sup> , <sup>±</sup>max<sup>s</sup> , <sup>±</sup>min<sup>n</sup>, <sup>±</sup>max<sup>n</sup>} RRM : {RNA, RNE, RTE, RTP, RTZ} RInt : {+, −, 0, 1} RReal : {+, −, ÷, 0, 1}


**Table 1.** Set of operators considered in SyGuS grammars.

The (non-constant) operators and their SMT-LIB names and types are listed in Table 1. Note that we further restrict the division operator ÷ of sort Real to division by value, i.e., we do not allow division by an arbitrary term of sort Real. We also add a set of special values of the corresponding sort to each default grammar. We represent bit-vector values of sort BV[n] as bit-strings of length n, where the left-most bit is the most significant bit. For floating-point values of sort FP[e,s], we use bit strings where the left-most bit indicates the sign, the following e bits represent the exponent, and the remaining bits the significand. For the theory of fixed-size bit-vectors, we use smax[n] or smin[n] for the maximum or minimum signed value of width n, e.g., smax[4] = 0111 and smin[4] = 1000, and ones[n] for the maximum unsigned value, e.g., ones[4] = 1111. For the theory of floating-point numbers, we use ±0 for positive and negative zero, ±∞ for positive and negative infinity, and NaN for not a number, e.g., −0[3,5] = 10000000 and <sup>+</sup>∞[3,5] = 01110000. We further use <sup>±</sup>min<sup>s</sup> for the positive and negative smallest subnormal, <sup>±</sup>max<sup>s</sup> for the positive and negative largest subnormal, <sup>±</sup>min<sup>n</sup> for the positive and negative smallest normal, and <sup>±</sup>max<sup>n</sup> for the positive and negative largest normal, e.g., <sup>−</sup>max<sup>s</sup> [3,5] = 10001111 and +min<sup>n</sup> [3,5] = 00010000. In the definition of grammar RFP above, we use symbol ± to indicate that both the positive and negative variant of a special value is included in the grammar.

We extend the above set of default grammars (**grammar**<sup>S</sup> in Algorithm 1) with ground terms that occur in an input set {Q1,...,Qn}∪G<sup>0</sup> based on the sort of variable x ∈ *x* in Q<sup>i</sup> = ∀*x*. P[*x*] and a term selection strategy. This strategy is based on the following two factors. We consider three modes for the scope of ground terms: (1) ground terms that occur in quantified formula Q<sup>i</sup> (strategy in) (2) ground terms that occur in the set of ground formulas G (strategy out), and (3) the union of (1) and (2) (strategy both). We consider three modes for the size of ground terms, defined as the number of subterms a term consists of: (a) terms of minimal size, i.e., constants that occur in a term (strategy min) (b) terms of maximal size (strategy max), and (c) the union of (a) and (b) (strategy both). For example, for a ground term a + b · c, strategy min will select a, b, c, max will select a+b · c, and both will select a, b, c, a+b · c. Each of the scope and size modes may be combined, giving 3 ∗ 3 = 9 possible term selection strategies.

Example 1. Let Q = ∀x. x·x ≈ a·a+b · b+ 2 ·a· b where x, a, b have integer type and suppose we run **syqi**({Q}, ∅). The algorithm first constructs the grammar **grammar**<sup>S</sup> (x) for <sup>x</sup>, where we assume term selection strategy <sup>S</sup> with scope in and size min, which considers ground terms that occur in Q and are of minimal size (2, a, and b). This grammar is encoded as the following datatype Z:

$$\mathcal{Z} = \mathsf{zero} \mid \mathsf{one} \mid \mathsf{plus}(\mathcal{Z}, \mathcal{Z}) \mid \mathsf{minus}(\mathcal{Z}, \mathcal{Z}) \mid \mathsf{two} \mid \mathsf{a} \mid \mathsf{b}$$

The algorithm introduces a fresh datatype variable d<sup>x</sup> of type Z, a fresh integer variable ed<sup>x</sup> of integer type, and adds l ⇒ ed<sup>x</sup> · ed<sup>x</sup> ≈ a · a + b · b + 2 · a · b to the internal set G of ground formulas, where l is a fresh Boolean variable. In the first iteration of the loop, we have that G (and G ∪ {l}) are satisfiable. Hence, the algorithm calls **select lemmas**<sup>L</sup> on Q and a model I for G; assume that dI <sup>x</sup> = zero and e<sup>I</sup> <sup>d</sup><sup>x</sup> = a<sup>I</sup> = b<sup>I</sup> = 0. Based on the lemma selection strategy, we may choose to add the instantiation lemma 0 · 0 ≈ a · a + b · b + 2 · a · b, or the evaluation lemma iszero(dx) ⇒ ed<sup>x</sup> ≈ 0, or both lemmas to G. Assuming both lemmas are added to G, the next iteration of the loop will consider a new model <sup>I</sup> where <sup>d</sup>I <sup>x</sup> <sup>=</sup> zero and <sup>e</sup>I <sup>d</sup><sup>x</sup> = 0. The algorithm will continue finding models with new values for <sup>d</sup>x, until it finds a model <sup>I</sup> where <sup>d</sup>I <sup>x</sup> = plus(a, b). At this point the instantiation lemma (a+b)·(a+b) ≈ a · a+b · b+ 2 · a · b will be added to G, which is equivalent to false, and **syqi** will terminate with unsat.

#### **3.2 Implementation Details**

We implemented syntax-guided quantifier instantiation in the CVC4 [5] solver, which has support for a wide range of background theories, covering all those in the SMT-LIB standard library [2]. CVC4 is based on the CDCL(T) (formerly DPLL(T)) framework [19]. This framework integrates a propositional SAT solver, which attempts to find a Boolean assignment that propositionally satisfies the input formula, with one or more specialize theory solvers, which monitor the assignments made by the SAT solver to theory literal and flag a conflict if the assignments are ever inconsistent in their theory.

Our SyQI technique is implemented as a module of the subsolver of CVC4 that handles quantified formulas. We leverage CVC4's support for smart enumerative SyGuS as described in Reynolds et al. [22]. Specifically, the **check** method in line 7 in Algorithm 1 involves calling the (combination) of quantifier-free theory solvers, which includes an extension of the theory of datatypes described in the following.

Symmetry Breaking for Smart Enumerative Synthesis. As described in previous work [22, 24], CVC4 uses advanced techniques for symmetry breaking for the datatypes over which context-free grammars are embedded. The quantifier-free

datatype theory solver in CVC4 is extended to issue symmetry blocking clauses based on reasoning about such datatypes, so that the models we generate for a datatype variable d are such that **to term**(d) is unique with respect to rewriting. For example, the terms a + b and b + a are equivalent, and in CVC4, one will be rewritten to the other. Thus, we know that we only have to consider one variant, e.g., a + b. Hence, the extended datatypes solver may issue the blocking clause ¬isplus(d) ∨ ¬isb(selZ,1(d)) ∨ ¬isa(selZ,2(d)), effectively stating that the term associated with d should not be b + a. This technique is highly valuable for syntax-guided synthesis, since it reduces the set of terms considered in the search for candidate solutions. In the context of this work, these techniques are of great importance, since they guarantee that our algorithm does not consider multiple instantiations over tuples of pairwise equivalent terms.

Quantified Formulas within Boolean Structure and Nested Quantification. As mentioned earlier, while not shown in Algorithm 1, our approach uses standard techniques for handling qeneral quantified formulas, in particular with quantifiers that occur below Boolean connectives. In the context of CDCL(T), for each quantified formula Q<sup>i</sup> of the form ∀*x*. P[*x*], the propositional model of our Boolean structure may either assign Q<sup>i</sup> to true or false, or leave it unassigned. Quantified formulas that are assigned to false are Skolemized, i.e., a lemma of the form ¬Q<sup>i</sup> ⇒ ¬P[*k*], where *k* are fresh constants, is returned to the SAT solver. Quantified formulas that are unassigned are ignored. Quantified formulas that are assigned to true are either active or inactive based on the value assigned to their counterexample literals. Those that are active are processed via **select lemmas**L. In practice, instantiation lemmas are guarded so that Q<sup>i</sup> ⇒ P[*t*] is returned to the SAT solver, meaning that the conclusion only holds when Q<sup>i</sup> is assigned to true. Furthermore, each Q<sup>i</sup> may have nested quantification, that is, the formula P the counterexample lemma l<sup>i</sup> ⇒ ¬P[**e***<sup>d</sup><sup>x</sup>* ] may contain quantified subformulas. Those quantified formulas are then processed by our full algorithm in the same way as quantified formulas from the input.

#### **4 Experiments**

We implemented our approach in the SMT solver CVC4 [5]. We provide here an extensive evaluation of the techniques and strategies described in Section 3. We first evaluate term and lemma selection strategies for grammar construction, and then compare the performance of our best configuration against Z3 [16], the only state-of-the-art SMT solver besides CVC4 that supports all the logics supported by our implementation.

We performed all experiments on a cluster with Intel Xeon CPU E5-2620 CPUs with 2.1GHz and 128GB memory. We used a time limit of 300 seconds, and an 8GB memory limit for each solver/benchmark pair and count memory out as time out. We evaluate here all configurations on all quantified logics in SMT-LIB [2] that do not contain uninterpreted functions (UF). As an exception, we include the logic UFBV, since the benchmarks in this logic rely


**Table 2.** Selection strategies on considered logics (15,746 benchmarks).

almost entirely on BV reasoning only. We generally exclude logics with UF since for such logics counterexample-guided techniques, as in our approach, are not expected to be more effective than heuristic instantiation techniques such as E-matching, which we confirmed in a preliminary evaluation. Overall, we include logics BV (bit-vectors), FP (floating-point arithmetic), LIA (linear integer arithmetic), LRA (linear real arithmetic), NIA (non-linear integer arithmetic), NRA (non-linear real arithmetic), and their combinations BVFP, BVFPLRA, FPLRA, and UFBV. In total, our benchmark set consists of 15,746 benchmarks.

**Term Selection for Grammar Construction.** As a first experiment, we determine the best combination of scope-based and size-based ground term selection strategies for grammar construction as introduced in Section 3.1. We combine strategies based on scope with strategies based on term size into nine selection strategies: in-min, in-max, in-both, out-min, out-max, out-both, both-min, both-max, both-both. The results for our SyGuS instantiation approach with these strategies enabled is shown in Table 2. Note that preliminary experiments identified lemma selection strategy interleave as the best. Hence, we use strategy interleave as the lemma selection strategy for this experiment.

Overall, using strategy both for the scope performs best. Furthermore, for this strategy all three size-based strategies perform equally well. For the remaining experiments, we use strategy both-both as the term selection strategy for grammar construction, where both minimal and maximal ground terms are selected from both the quantified formula Q<sup>i</sup> (containing the variable we construct a grammar for) and the set of ground formulas G. Note that we choose the more general strategy both-both over strategy both-max even though both-max performs slightly better.

**Lemma Selection.** In our second experiment, we determine the best lemma selection strategy out of the three strategies priority-inst, priority-eval and interleave described in Section 3. The results are shown in Table 2. Note that we use the previously determined best term selection strategy both-both in this experiment.

The best overall strategy is interleave, indicating that it is beneficial to consider instantiation lemmas and evaluation unfolding lemmas in parallel. On the other hand, prioritizing evaluation lemmas over instantiation lemmas (priorityeval) performed significantly worse than the other two configurations. Since this strategy prioritizes evaluation lemmas, it has the advantage over other configurations of delaying instantiations until we obtain an interpretation I where the interpretation of ed<sup>x</sup> is consistent with respect to dx, i.e., e<sup>I</sup> <sup>d</sup><sup>x</sup> = **to term**(dx)I. As a consequence, prioritizing evaluation lemmas puts more effort into finding terms in instantiation that are guaranteed to refine the current candidate model I. However, we conclude from these results that it is often effective to consider instantiations in an eager fashion, either in parallel or even before considering evaluation lemmas. This is likely because instantiation lemmas may often refine the set of possible models even when G does not yet force our evaluation variables to have an interpretation that is consistent with their corresponding datatype values. Nevertheless, we found that evaluation lemmas are often necessary in practice for ensuring our procedure does not get stuck on a single model. When only instantiation lemmas are used, our procedure often terminates the loop with no new lemmas. This is to be expected, as such a strategy violates the requirements for the progress property of Theorem 1.

In the remaining experiment, we use strategy interleave as the lemma selection strategy since it performs slightly better than priority-inst.

**Comparison Against Other Techniques.** Finally, we compare our SyGuS instantiation approach against other techniques implemented in CVC4, the stateof-the-art SMT solvers Z3 [16] (version 4.8.9) and Boolector [17] (version 3.2.1), and the superposition-based theorem prover Vampire [13] (version 4.5.1). Note that Boolector implements counterexample-guided model synthesis [20] but only supports the SMT-LIB logic BV, whereas Vampire supports LIA, LRA, NIA, and NRA. We consider the following four configurations of CVC4: **ematch**: with E-matching [15] enabled; **cegqi**: with CEGQI for linear arithmetic [25] and bit-vectors [18] enabled, falls back to value-based instantiation techniques for other theories; **enum**: with enumerative instantiation [21] enabled; **syqi**: with our SyGuS instantiation approach enabled. We use strategy both-both for term selection, and interleave for lemma selection.

The results are summarized in Table 3. First, note that Z3 disagrees on 10 benchmarks in logic FP with the other four CVC4 configurations. This is due to a known problem in Z3 related to operator rem, where it answers sat instead of unsat. We do not count these 10 benchmarks as solved and give the number of disagreements in parenthesis marked with a \* in Table 3.

Overall, note that E-matching (**ematch**) performs very poorly on these benchmark sets. This is not surprising since it is designed with a focus on problems with uninterpreted functions. To a lesser extent, enumerative instantiation (**enum**) also performs poorly, probably also due to the fact that it is not designed for inputs without uninterpreted functions. In detail, both this configuration and


**Table 3.** SyQI vs. other techniques, Z3, Boolector, and Vampire (15,746 benchmarks).

**syqi** are enumerative in nature. The former uses a selection strategy based on the evolving ground terms in the current context, whereas the latter uses a fixed grammar built from the initial set of terms. In a sense, **syqi** leverages the power of a grammar for discovering new terms, whereas **enum** adapts to what terms are generated by instantiations. Overall, **syqi** solves 556 more benchmarks than enumerative instantiation, justifying the need for a syntax-guided approach for instantiation for inputs that are rich in background theories.

Our results show that **syqi** is remarkably competitive when compared to **cegqi**, which uses the best known theory-specific instantiation strategies. The performance of syntax-guided instantiation matches or exceeds counterexampleguided instantiation on logics BVFP, BVFPLRA, FP, FPLRA, NIA, NRA, and UFBV. In particular, for quantified floating-point arithmetic (FP), the performance of **syqi** significantly outperforms **cegqi**, where it solves 224 more benchmarks. We attribute this to the fact that **cegqi** only performs value-based instantiation, whereas the use of grammars is effective in determining useful symbolic terms to use in instantiations for this theory. Interestingly, **syqi** solves the only satisfiable benchmark in the NIA category that is unsolved by **cegqi**, meaning that in a portfolio setting with all available configurations, CVC4 solves all benchmarks in this category. On the other hand, counterexample-guided instantiation outperforms **syqi** on logics such as LIA, LRA, and BV, where wellestablished instantiation strategies exist. Syntax-guided techniques are especially ineffective for linear real arithmetic, since it is often important to construct specific real constants based on solving sets of linear (in)equalities [25].

Comparing all configurations of CVC4 with Z3, Boolector, and Vampire, we see that in some logics like LIA and NIA, counterexample-guided instantiation in CVC4 outperforms Z3 and Vampire, whereas in other logics like NRA, UFBV, and many logics that combine BV, FP and LRA, Z3 performs best. For the logic BV, Boolector outperforms CVC4 and Z3; however, CVC4 solves the most unsatisfiable instances. The **syqi** configuration performs best on the floatingpoint benchmarks, where it solves 181 more than the closest competitor. When comparing the four CVC4 configurations in terms of uniquely solved instances, **cegqi** uniquely solves 660 instances, **syqi** 119 instances, **enum** 117 instances, and **ematch** not a single one. Between configurations **cegqi** and **syqi**, the former uniquely solves 1479 instances, and the latter 402 instances.

In summary, theory-specific approaches as implemented in CVC4, Z3, and Boolector outperform **syqi** in categories where instantiation strategies are highly mature, such as linear integer and real arithmetic, and fixed-width bit-vectors. Nevertheless, our evaluation demonstrates the versatility of the approach, especially for benchmarks using quantified floating-point arithmetic or combined theories where no good approach to quantifier instantiation was known.

#### **5 Conclusion**

We have presented a syntax-guided approach for quantifier instantiation and implemented it in the SMT solver CVC4. Our experiments show that our approach is a viable alternative to theory-specific quantifier instantiation techniques and can be applied to a wide range of logics. In particular, for the theory of floatingpoint arithmetic, syntax-guided instantiation in CVC4 significantly outperforms the state of the art. In future work, we plan to tune our grammar construction based on an analysis of which terms are more likely to appear in conflicts, which can potentially be done automatically. Another direction of future work is to provide an interface that would allow users to supply their own grammars for use in SyQI, similarly to the user-provided triggers for E-matching. We also plan to use our approach as a baseline for quantified logics in recent (and future) new theories. Currently, support in SMT solvers is highly limited, for instance, for quantified formulas involving the theory of strings and regular expressions. Syntax-guided instantiation can serve as a baseline for potential user applications that rely on quantified formulas in these theories.

#### **References**


1-7, 2015, Proceedings. Lecture Notes in Computer Science, vol. 9195, pp. 197– 213. Springer (2015). https://doi.org/10.1007/978-3-319-21401-6 13, https://doi. org/10.1007/978-3-319-21401-6 13


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Making Theory Reasoning Simpler**

Giles Reger1, Johannes Schoisswohl1(-) , and Andrei Voronkov1,<sup>2</sup>

> <sup>1</sup> University of Manchester, Manchester, UK <sup>2</sup> EasyChair, Manchester, UK johannes.schoisswohl@manchester.ac.uk

**Abstract** Reasoning with quantifiers and theories is at the core of many applications in program analysis and verification. Whilst the problem is undecidable in general and hard in practice, we have been making large pragmatic steps forward. Our previous work proposed an instantiation rule for theory reasoning that produced pragmatically useful instances. Whilst this led to an increase in performance, it had its limitations as the rule produces ground instances which (i) can be overly specific, thus not useful in proof search, and (ii) contribute to the already problematic search space explosion as many new instances are introduced. This paper begins by introducing that specifically addresses these two concerns as it produces general solutions and it is a simplification rule, i.e. it replaces an existing clause by a 'simpler' one. Encouraged by initial success with this new rule, we performed an experiment to identify further common cases where the complex structure of theory terms blocked existing methods. This resulted in four further simplification rules for theory reasoning. The resulting extensions are implemented in the Vampire theorem prover and evaluated on SMT-LIB, showing that the new extensions result in a considerable increase in the number of problems solved, including 90 problems unsolved by state-of-the-art SMT solvers.

#### **1 Introduction**

Many applications of reasoning in program analysis and verification depend on reasoning with the first-order theory of arithmetic, often in combination with other theories and quantifiers. A common approach to this problem is via Satisfiability Modulo Theory (SMT) solving, which has strong support for decidable theories but may struggle to scale in the presence of quantifiers. Conversely, superposition-based first-order solvers handle quantifiers naturally and have, recently, been extended to reason with theories [2,3,5,6,9,13,16,21]. Such solvers are based on a saturation loop and tend to suffer from search space explosion. This is compounded by the effective but explosive use of theory axioms, leading to the derivation of numerous inconsequential consequences of the theory. So far we have attempted to control this explosive behaviour [10,17] but now we aim to eliminate some of it. This paper introduces a set of simplification rules for reasoning in the theory of (any combination of linear or non-linear real, rational, or integer) arithmetic, i.e. rules that make reasoning in arithmetic simpler.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 164–180, 2021. https://doi.org/10.1007/978-3-030-72013-1\_9

This work was motivated by our previous attempt [20] to find useful instances of first-order clauses that would be otherwise difficult to find via reasoning with theory axioms. For example, when considering the two clauses

$$r(7x) \qquad \neg r(6+y) \lor p(y)$$

our previous work would apply resolution on r(7x) and ¬r(6+y) using unification with abstraction to produce the clause 7x = 6+y ∨p(y) and then applied theory instantiation, utilising an SMT solver to find the substitution {x → 1, y → 1}, producing the instance p(1). This may or may not be useful to proof search and, crucially, we need to keep performing inferences with the original clauses in case it is not. In this case, we would prefer to instantiate with {y → 7x − 6} to produce 7x = 6 + (7x − 6) ∨ p(7x − 6), which can be reduced to p(7x − 6). This is a general solution (being logically equivalent) that is also simpler – in this case it has fewer variables than the original clause. Hence, we replace the clause by the more general result, aiding proof search and preventing the addition of unnecessary instances.

The above was motivated by the observation that we would often see clauses of the form \$kx <sup>=</sup> <sup>t</sup> <sup>∨</sup> <sup>C</sup>[x] (for numeral \$k, variable <sup>x</sup>, and term <sup>t</sup>) and expend much effort using theory axioms to rewrite \$kx <sup>=</sup> <sup>t</sup> into <sup>x</sup> <sup>=</sup> <sup>t</sup> k . This led us to conduct an experiment to identify other common cases where arithmetic clauses could be simplified. An immediate observation is that, if x ranges over the reals, <sup>p</sup>(7x−6) can be instantiated with {<sup>x</sup> <sup>→</sup> (y+6) <sup>7</sup> } to produce p(y). Furthermore, in the above example we no longer need to employ the expensive unification with abstraction as we can instantiate <sup>r</sup>(7x) with {<sup>x</sup> <sup>→</sup> <sup>z</sup> <sup>7</sup> } to produce r(z) and then resolve with r(6 + y) ∨ p(y) to produce p(y) directly.

Another observation was that a large amount of effort was expended by the theorem prover reordering sums and products to expose seemingly obvious structure. For example, taking (3t + x)+2t and producing 5t + x requires three theory axioms and 12 rewriting steps. To combat this, we introduce an evaluation method that flattens sums and products, reorders and simplifies them, before reintroducing the necessary bracketed structure. A related common issue was the occurrence of terms that could easily be cancelled, such as in 4x + 3 < 4x + 10, again requiring significant rewriting effort that can be replaced by a special rule.

This paper does not present the exploratory experimentation described above but focusses instead on the fruits of this work. After introducing the necessary preliminaries (Sec. 2), we make the following contributions:


**–** A rule for cancelling subterms, e.g. in 4x + 3 < 4x + 10 (Sec. 6)

These rules are all implemented in the Vampire [1,14] theorem prover. Our experimental evaluation (Sec. 7) shows that the new rules significantly improve the number of problems (from SMT-LIB) that Vampire can solve. Our final experiment shows that the new Vampire can solve 1,052 problems unsolved by Vampire 4.5, 1,056 problems unsolved by CVC4, and 1,350 problems unsolved by Z3 — given their complementary nature, this equates to 90 problems unsolved by any of these state-of-the-art solvers.

### **2 Preliminaries and Related Work**

First-Order Logic and Theories. We consider a many-sorted first-order logic with equality. A signature is a pair Σ = (Ξ,Ω) where Ξ is a set of sorts and Ω a set of predicate and function symbols with associated argument and return sorts from Ξ. Terms are of the form c, x, or f(t1,...,tn) where f is a function symbol of arity n ≥ 1, t1,...,t<sup>n</sup> are terms, c is a zero arity function symbol (i.e. a constant) and x is a variable. We assume that all terms are well-sorted and write t : σ if term t has sort σ. Atoms are of the form p(t1,...,tn), q or t<sup>1</sup> \*<sup>s</sup> t<sup>2</sup> where p is a predicate symbol of arity n, t1,...,t<sup>n</sup> are terms, q is a zero arity predicate symbol and for each sort s ∈ Ξ, \*<sup>s</sup> is the equality symbol for the sort s. We write simply \* when s is known from the context or irrelevant. A literal is either an atom A, in which case we call it positive, or a negation of an atom ¬A, in which case we call it negative. When L is a negative literal ¬A and we write ¬L, we mean the positive literal A. For negative literals with binary predicates ¬(t<sup>1</sup> ♦ t2) (like, e.g. equality), we sometimes write t<sup>1</sup> ♦ t2.

A clause is a disjunction of literals L<sup>1</sup> ∨ ... ∨ L<sup>n</sup> for n ≥ 0. We disregard the order of literals and treat a clause as a multiset. When n = 0 we speak of the empty clause, which is always false. When n = 1 a clause is called a unit clause. Variables in clauses are considered to be universally quantified. Standard methods exist to transform an arbitrary first-order formula into clausal form (e.g. [15] and our recent work in [19]).

In the following we use expression to mean a term, an atom, a literal, or a clause. We write E[t]<sup>p</sup> to denote an expression E containing a term t at position p (a position is a unique point in an expression's syntax tree) and may then write E[s]<sup>p</sup> to denote the same expression with t replaced by term s at p. We will normally leave the position p as implicit. A substitution is any θ of the form {x<sup>1</sup> → t1,...,x<sup>n</sup> → tn}, where n ≥ 0. Eθ is the expression obtained from E by the simultaneous replacement of each x<sup>i</sup> by ti. An expression is ground if it contains no variables. An instance of E is any expression Eθ and a ground instance of E is any instance of E that is ground. A unifier of two terms, atoms or literals E<sup>1</sup> and E<sup>2</sup> is a substitution θ such that E1θ = E2θ. It is known that if two expressions have a unifier, then they have a so-called most general unifier.

We assume a standard notion of a (first-order, many-sorted) interpretation I, which assigns a non-empty domain I<sup>s</sup> to every sort s ∈ Ξ, and maps every function symbol f to a function I<sup>f</sup> and every predicate symbol p to a relation I<sup>p</sup> on these domains so that the mapping respects sorts. We call I<sup>f</sup> the interpretation of f in I, and similarly for I<sup>p</sup> and Is. Interpretations are also sometimes called first-order structures. A sentence is a closed formula, i.e. with no free variables. We use the standard notions of validity and satisfiability of sentences in such interpretations. An interpretation is a model for a set of clauses if (the universal closure of) each of these clauses is true in the interpretation.

A theory T is identified by a class of interpretations. A sentence is satisfiable in T if it is true in at least one of these interpretations and valid if it is true in all of them. A function (or predicate) symbol f is called uninterpreted in T , if for every interpretation I of T and every interpretation I which agrees with I on all symbols apart from f, I is also an interpretation of T . A theory is called complete if, for every sentence F of this theory, either F or ¬F is valid in this theory. Evidently, every theory of a single interpretation is complete. We can define satisfiability and validity of arbitrary formulas in an interpretation in a standard way by treating free variables as new uninterpreted constants.

The theories we will deal with are the theories of integer, rational, and real arithmetic with uninterpreted functions, denoted by T<sup>Z</sup>, T<sup>Q</sup>, and T<sup>R</sup>, which fix the interpretation of a distinguished sort σZ, σQ, and σ<sup>R</sup> to the set of mathematical integers Z, rationals Q, and reals R respectively, and assign the usual meanings to the function and predicate symbols {+, <sup>−</sup>, <, <sup>≤</sup>, ·}. By \$k, we denote the numeral interpreted as k in any of these theories. We consider signatures over these theories to additionally contain uninterpreted functions, and predicates, hence, in contrast to the case without unintpreted functions, for none of these theories there is a sound and complete proof system (see e.g. [13]).

Unless stated differently, we use the symbols x, y, z for variables, s, t, u for terms, C, D for clauses, p, q, r for predicate symbols, f, g, h for function symbols, and σ for substitutions, and sorts, with sometimes suffixes being added.

Term Orderings. A simplification ordering (see, e.g. [8]) on terms is an ordering that is well-founded, monotonic, stable under substitutions and has the subterm property. Such an ordering captures a notion of simplicity, i.e. t<sup>1</sup> ≺ t<sup>2</sup> implies that t<sup>1</sup> is in some way simpler than t2. Vampire uses the Knuth-Bendix ordering [12], which is parametrized by total precedence ordering on function and predicate symbols -. This is total on ground terms and partial on non-ground ones, leading to the possibility of incomparable terms, e.g. f(x, a) and f(b, y). A simplification ordering ≺ on terms can be extended to a simplification ordering on literals and clauses, using a multiset extension of orderings. For simplicity, we will use ≺ to refer to the term ordering and its lifting. Whenever E<sup>1</sup> ≺ E<sup>2</sup> (E<sup>2</sup> ≺ E1) we say that E<sup>1</sup> is smaller (bigger) than E2. An equality literal t \* s is oriented if t ≺ s or s ≺ t.

Saturation-Based Proof Search. We introduce our new rules within the context of saturation-based proof search. The general idea in saturation is to maintain two sets of Active and Passive clauses. A saturation-loop then selects a clause C from Passive, places C in Active, applies generating inferences between C and clauses in Active, and finally places newly derived clauses in Passive after applying some retention tests. The retention tests involve checking whether the new clause is itself redundant (i.e. a tautology) or redundant with respect to existing clauses (implied by a set of smaller clauses in Active ∪ Passive). Rules that remove the parent clause immediately from the search space without performing a retention test are called immediate simplification rules. Whenever there are applicable immediate simplification rules, the first one wrt. some fixed ordering is chosen to be applied to the selected clause instead of applying any other rule. The rules introduced in this paper are all introduced as immediate simplification rules. However, as mentioned later, not all of them strictly obey the requirement that the result is smaller. Normally this would have implications on the completeness of the approach but we lose completeness when we start reasoning with theories. This leads us to a trade-off between the potential loss of some proofs by missing some inferences, and the potential gain via simplifying proof search. Our later experimental results show that forgoing completeness is of pragmatic interest.

Superposition Calculus. Vampire works with the superposition and resolution calculus (see our previous work [11,14] for a description). The calculus itself is not of direct interest to this work. We do, however, draw attention to two rules. Firstly, the Equality Resolution rule

$$\frac{s \not\succeq t \lor C}{C\theta} \quad \theta \text{ is a most general uniform of } s \text{ and } t.$$

is a starting point for both our previous theory instantiation work and the Gaussian Variable Elimination rule introduced later (Sec. 3). Secondly, we draw attention to the Demodulation (or rewriting by unit equalities) rule

$$\frac{l \simeq r \quad \underline{L}[t] \not\hookrightarrow C}{L[r\theta] \lor C}$$

where lθ = t, rθ ≺ lθ, and (l \* r)θ ≺ L[t]∨ C. This is of interest as later we will need to take special care of the last side-condition when evaluating terms.

Theory Reasoning. To perform theory reasoning within this context it is common to do two things. Firstly, to evaluate new clauses to put them in a common form (e.g. rewrite all inequalities in terms of <) and evaluate ground theory terms and literals (e.g. 1+ 2 becomes 3 and 1 < 2 becomes false). More complex evaluation is possible and is the subject of this work (see Section 5). Secondly, relevant theory axioms can be added to the initial search space. For example, if the input clauses use the + symbol one can add the axioms x + y \* y + x and x + 0 \* x, among others.

In addition to these basic methods, Vampire also employs a number of other techniques. AVATAR modulo theories [16] uses an SMT solver within the context of clause splitting to ensure that the ground part of any chosen clause splits are theory-consistent. The previously mentioned unification with abstraction and theory instantiation [20] rules support lazy unification modulo theories and pragmatic instantiation. Theory axiom usage can be controlled by the set of support strategy [17] or layered clause selection [10]. Both approaches de-prioritise reasoning with theory axioms.

#### **3 Gaussian variable elimination**

Recall the example 7x =6+ y ∨ p(y) from the Introduction (Sec. 1) where we want to identify the substitution {y → 7x − 6} to produce the simpler instance p(7x − 6). Our general approach is to rewrite 7x =6+ y in terms of y and then apply the standard Equality Resolution rule introduced in Sec. 2. This gives us the straightforward rule:

$$\frac{s \not\ni t \lor C[x]}{C[u]} \text{ gve}$$

where x : σZ, x : σQ, or x : σR, s, t =⇒<sup>∗</sup> gve x, u, or t, s =⇒<sup>∗</sup> gve x, u and x is not a subterm of u. The relation =⇒<sup>∗</sup> gve is the reflexive, and transitive closure of the relation =⇒gve which can be defined as follows.

$$\begin{array}{ll} \langle s+t,u\rangle \Longrightarrow\_{\mathsf{gre}} \langle s,u+(-t)\rangle\\ \langle s+t,u\rangle \Longrightarrow\_{\mathsf{gre}} \langle t,u+(-s)\rangle \end{array} \qquad \begin{array}{ll} \langle -s,t\rangle \Longrightarrow\_{\mathsf{gre}} \langle s,-t\rangle\\ \langle -s,t\rangle \Longrightarrow\_{\mathsf{gre}} \langle s,u\rangle \begin{array}{l} \langle -s,t\rangle \Longrightarrow\_{\mathsf{gre}} \langle s,-t\rangle\\ \langle \hat{s}\cdot t,u\rangle \Longrightarrow\_{\mathsf{gre}} \langle t,u\rangle \end{array} \qquad \text{if}\ t \neq 0,\text{and}\ \hat{t}:\sigma\_{\mathsf{Q}},\text{ or}\ \hat{t}:\sigma\_{\mathsf{R}}\\ \langle \hat{s}\cdot t,u\rangle \Longrightarrow\_{\mathsf{gre}} \langle t,u\rangle \begin{array}{l} \langle \hat{s}\cdot t\rangle, \langle \hat{s}\rangle \end{array} \qquad \text{if}\ s \neq 0,\text{and}\ \hat{s}:\sigma\_{\mathsf{Q}},\text{ or}\ \hat{t}:\sigma\_{\mathsf{R}}\\ \end{array}$$

It should be noted that =⇒gve is not normalising. The pair s<sup>1</sup> + s2, t can, for example, be rewritten to s1, t − s<sup>2</sup>, as well as to s2, t − s<sup>1</sup>. But due to the fact that there is at most a linear number of such rewritings, we can enumerate all of them and choose the first x, t, such that x is not a subterm of t. Further choice comes from the fact that we can either rewrite based on l, r, or based on r, l. Looking at our example, we could rewrite

$$\langle 6+y, 7x \rangle \implies\_{\mathsf{gve}} \langle y, 7x - 6 \rangle$$

but also

$$\left<7x, 6+y\right> \implies\_{\mathsf{gve}} \left\,$$

if x is not of integer sort, leaving us with a choice. Another source of choice comes from the fact that our premise can contain multiple negative equalities. Any of those could potentially be used to rewrite the rest of the clause.

Since application of the rule, will yield a logically equivalent conclusion, with fewer literals and fewer distinct variables, we make an arbitrary choice. For the same reason, we implement this as a simplification rule (thus removing the premise from the search space) even though the conclusion will often be incomparable to (not smaller than) the premise.

To further demonstrate this rule we consider the additional example

$$\frac{\frac{p(7xxxy - 6)}{p(7xxx - 6)} \text{ asg}\_{\text{var}}^{+}}{\frac{p(7x - 6)}{p(7x - 6)} \text{ asg}\_{\text{var}}^{+}}$$
 
$$\frac{p(x - 6)}{p(x)} \text{ asg}^{+} $$

**Figure 1.** Illustration of the 4 generalization rules, in the theory of Reals.

$$\frac{\frac{x+y \neq 36 \lor x+3y \neq 90 \lor p(x,y)}{(36-y)+3y \neq 90 \lor p(36-y,y)}}{\frac{36+2y \neq 90 \lor p(36-y,y)}{\lor p(36-(90-36)/2,(90-36)/2)}} \text{ \*\*yes}$$

which highlights the need to interleave evaluation between successive Gaussian elimination steps — we discuss our evaluation strategy below.

#### **4 Arithmetic subterm generalization**

Taking a closer look at the choice for our example from the previous section, we see that we could have instantiated the premise y + 6 \* 7x ∨ p(y) either with {y → 7x − 6} to get p(7x − 6), or with {x → (6 + y) / 7} to obtain p(y) (again, assuming that x is not of integer sort). Both of the clauses are logically equivalent in T<sup>Q</sup>, and T<sup>R</sup>, since the earlier is an instance of the latter, and the latter implies the earlier as we can apply the substitution {x → (y + 6) / 7} and simplify the result to the earlier clause. Obviously this kind of reasoning can be applied for any linear subterm \$<sup>k</sup> · <sup>x</sup> <sup>+</sup> <sup>d</sup> where <sup>k</sup> = 0.

Splitting this idea into multiple rules lets us take these generalizations further. Therefore we propose 4 rules for arithmetic subterm generalization, that are illustrated in a single example in Figure 1.

Since we do not want the applicability of our generalization rules to depend on associativity and commutativity (AC) we will formulate them modulo AC. For this purpose we introduce the following notation. We use C[t]AC to denote a clause that contains the subterm t modulo AC. Further we use C[t ]AC to denote the same clause, but all occurrences of t modulo AC, being replaced by t .

Addition Generalization

$$\frac{C[x+t\_1+\ldots+t\_n]\_{AC}}{C[x]\_{AC}}\text{ asg}^+$$

where

$$-\ x: \sigma \text{ for some } \sigma \in \{\sigma\_{\mathbb{Z}}, \sigma\_{\mathbb{Q}}, \sigma\_{\mathbb{R}}\}$$


The first rule deals with the case where a clause contains a sum with a variable as summand. Such a sum can be generalized by applying the substitution {x → x − t<sup>1</sup> − ... − tn} , and simplifying the result.

Numeral Multiplication Generalization

$$\frac{C[k \cdot x \cdot t\_1 \cdot \ldots \cdot t\_n]\_{AC}}{C[x \cdot t\_1 \cdot \ldots \cdot t\_n]\_{AC}} \text{ asg}\_{\text{num}}^{\cdot}$$

where


In the second rule we generalize a product that contains one variable that occurs only once in this product. Its soundness is justified by the substitution {<sup>x</sup> <sup>→</sup> \$ x k }.

Variable Multiplication Generalization

$$\frac{C[x \cdot x\_1 \cdot \ldots \cdot x\_n]\_{AC}}{C[x]\_{AC}} \text{ asg}^{\cdot}\_{\text{var}}$$

where


**–** x = x<sup>i</sup>

In this rule we generalize subterms that are products of variables, containing redundant variables. The rule is sound since we can replace <sup>x</sup><sup>i</sup> by \$1.

Variable Power Generalization

$$\frac{C[x^n]\_{AC}}{C[x^k]\_{AC}} \text{ зэрэг}$$

where

$$\begin{cases} - & x: \sigma\_{\mathbb{R}} \\ - & x^n \text{ is an abbreviation for } x \cdot x \cdot \ldots \cdot x \\ - & k = \begin{cases} 1 & \text{if } n \text{ is odd} \\ 2 & \text{if } n \text{ is even} \end{cases} \\ - & \text{all occurrences of } x \text{ are in the term } x^n \text{ (modulo AC)} \end{cases}$$

The last rule lets us generalize away redundant powers of variables. Its soundness is guaranteed by the fact, that for Real numbers the co-domains of x<sup>n</sup> and x<sup>k</sup> are the same.

All of the above rules produce a result that is smaller with respect to any simplification ordering due to the removal of terms, justifying their implementation as immediate simplifications.

#### **5 Evaluation**

As mentioned above, reasoning with arithmetic often requires us to be able to evaluate terms — evaluations such as 3 + 3 =⇒ 6 and f(x)+0=⇒ f(x) are straightforward but we also want to support evaluations such as (3t+x)+2t =⇒ 5t + x for variable x and arbitrary term t. We introduce a new method for this (replacing a previous ad-hoc method implemented in Vampire). The general idea is to first rewrite terms into a special normal form, apply simplifying steps that preserve this form, and then denormalise to obtain standard terms again. We describe the three steps in detail below.

Normalization. This step removes the need to take care of reordering and bracketing of terms. Our general normal form is as follows

$$
\hat{c}\_1 \cdot (t\_{1,1} \cdot \ldots \cdot t\_{1,k\_1}) + \ldots + \hat{c}\_n \cdot (t\_{n,1} \cdot \ldots \cdot t\_{1,k\_n})
$$

where ti,j ≺<sup>1</sup> ti,j+1 and (ti,<sup>1</sup> · ... · ti,k<sup>i</sup> ) ≺<sup>2</sup> (t<sup>i</sup>+1,<sup>1</sup> · ... · t<sup>i</sup>+1,ki+1 ). To get to this normal form we rewrite <sup>−</sup><sup>t</sup> as <sup>−</sup><sup>1</sup> · <sup>t</sup>, rewrite t / \$<sup>c</sup> as <sup>t</sup> · \$<sup>1</sup> <sup>c</sup> , rewrite t as 1 · t where necessary, and sort with respect to ≺<sup>1</sup> and ≺2. Both relations ≺1, and ≺<sup>2</sup> need to be strict total orderings, on terms, and ≺1-sorted lists of terms respectively. Vampire uses so-called aggressive sharing for terms, meaning that for each distinct term there is at most one instance present in memory, and copies are being made by copying the term's id. Hence we can define ≺<sup>1</sup> as comparing the ids of two terms. We use the same approach for ≺2.

Simplification. Once in normal form, terms can be simplified by joining coefficients for identical terms and removing terms multiplied by zero. This can be given as follows:

$$
\widehat{c} \cdot t \cdot \ldots \widehat{d} \dots \cdot u \Longrightarrow\_{\text{eval}} \widehat{c} d \cdot t \cdot \ldots \cdot u
$$

$$
s + \ldots \widehat{c}\_1 \cdot t + \widehat{c}\_2 \cdot t \ldots + u \Longrightarrow\_{\text{eval}} s + \ldots \widehat{c\_1 + c\_2} \cdot t \ldots + u
$$

$$
s + \ldots + \widehat{0} \cdot t + \ldots + u \Longrightarrow\_{\text{eval}} s + \ldots + u
$$

If we would generate an empty sum by removing an addition we will simplify to \$<sup>0</sup> instead. All of these steps can be implemented in linear time and in a bottom up manner, since we firstly can rely on the terms being sorted by the non-numeral parts of their summands, and secondly on a numeral part of a product being on a fixed position.

Denormalisation. Finally, as the normal form contains redundant information (such as 1 · t + ... instead of t + ...) we need to denormalise as follows:

$$\begin{aligned} -1 \cdot (t\_1 \cdot \ldots \cdot t\_n) & \Longrightarrow (t\_1 \cdot (\ldots \cdot (t\_{n-1} \cdot (-t\_n)) \ldots )) \\ 1 \cdot (t\_1 \cdot \ldots \cdot t\_n) & \Longrightarrow (t\_1 \cdot (\ldots \cdot (t\_{n-1} \cdot t\_n) \ldots )) \end{aligned}$$

We define the rule eval to be the chain of normalising, simplifying and denormalising a clause in a bottom-up manner, which is only applied if the step of simplification is successful for some subterm. The reason for not always applying the rules is to prevent arbitrary reordering of sums and products, which in many cases leads to conclusions being bigger than the premise. This can have significant consequences beyond perturbing proof search. Consider the following scenario involving the Demodulation rule (see Sec. 2).

$$\begin{array}{cc} x+y \simeq y+x & k = a+(b+c) \\ \hline \hline k=a+(c+b) \\ \hline k=a+(b+c) \end{array} \text{denodulation}$$

This process would repeat itself ad infinitum as the initial clause is deleted, replaced by an identical clause. Evaluation would violate the side-condition that should have prevented this, if we would not insist on the step of simplification being successful for the rule to be applied.

In most cases this inference rule is a true simplification wrt. our simplification ordering, since we eliminate at least one symbol in each of the cases in the step simplification. Due to generating sometimes bigger terms in the normalisation, like in the case x+x ⇒ 1 · x+ 1 · x ⇒ 2 · x we sometimes violate the simplification ordering. Due to the fact that these cases do not occur too frequently, and completeness is not possible in our base theories, we ignore these violations.

During experimentation, we discovered many cases where a unary minus blocks our evaluation rule. Consider the following desired derivation

$$\frac{y+t \neq x \lor C[y+-x]}{C[y+-(y+t)]}$$

$$\frac{C[y+(-y+-t)]}{C[t]}$$

This is not currently possible as the weight of −y + −t is 5, which is larger than the weight of −(y + t), meaning the second step is not a simplification.

We introduce a simple fix by modifying the weight function and symbol precedence of the Knuth-Bendix ordering as follows:


As a result we can use the following rewrite rule as an additional simplifaction rule, since the right hand side has the same weight as the left hand side, but −, the outer most symbol on the left hand side, has higher precedence than + the one on the right hand side.

$$-\left(x+y\right) \Longrightarrow\_{\text{push}-}\left(-x\right)+\left(-y\right)$$

#### **6 Cancellation**

The motivation for our last rule was two-fold. Firstly evaluation of constant predicates can be helpful in some cases, but fails in seemingly trivial cases. One example for a case like this is the redundant literal 4x+ 3 < 4x+ 10. The simple approach of evaluating interpreted predicates fails since we are dealing with nonground symbols. However it can be simplified to a ground term that can then be evaluated, by cancelling away the 4x on both sides of the inequality.

The second motivation were cases where unification with abstraction yields literals in which gve could almost be applied but require a step of cancellation. An example for such a case is the derivation

$$\begin{array}{cc} p(5x) & \neg p(3x) \lor C[x] \\ \hline 3x \neq 5x \lor C[x] \\ \hline 0 \neq 2x \lor C[x] \\ \hline C[0] \end{array} \text{cancel}$$

In order to resolve both of these cases we propose the inference rule cancellation cancel, which consists of the following two symmetric cases depending on which side is cancelled.

$$\frac{s+\ldots\hat{n}t\ldots+u\diamond v+\ldots\hat{n}t\ldots+w\lor C}{s+\ldots+u\diamond v+\ldots+w\lor C}\text{ салсей}$$

where

$$\begin{array}{c} - \ \diamondsuit \in \{\simeq, \sharp, <, \leqslant, \lesssim, \not\subseteq\} \\\\ \frac{s+\ldots\hat{n}t\ldots+u\diamondv{\diamondsuit}+\ldots\hat{m}t\ldots+w\lor C}{s+\ldots+u\diamondv{\diamondsuit}+\ldots\widehat{m-n}t\ldots+w\lor C} \text{ салсе } \end{array} \text{сале } \mathbf{c}$$

where

$$\begin{array}{c} \widehat{-m-n} \ll \widehat{n-m} \\ - \quad \Diamond \in \{\simeq, \not\models, <, \not\leqslant, \not\leq, \not\leq\} \end{array}$$

$$\begin{array}{c} \frac{s+\ldots\hat{m}t\ldots+u\diamond v+\ldots\hat{m}t\ldots+w\vee C}{s+\ldots\hat{n-m}t\ldots+u\diamond v+\ldots+w\vee C} \text{ сасес} \end{array} \textbf{ссасе}$$

where

$$-\begin{array}{c}\widehat{n-m} \ll \widehat{m-n} \\ -\diamondsuit \in \{\simeq, \sharp, <, \leqslant, \lesssim, \not\le\} \end{array}$$

In order for the rule to not be sensitive to associativity and commutativity, we perform the same steps of normalisation and denormalisation as for the rule eval. Again we will only simplify a clause, if cancellation itself, not only normalisation and denormalisation, is applicable.

The rule is a simplification rule since the number of symbols is reduced with (almost) every application of the cancellation.

**Table 1.** Compares the number of problems solved with any configuration where a new option is enabled to the ones where it is disabled, with a runtime of 10 seconds. The column "both" lists how many were solved in either case. The columns "on", and "off" list how many additional problems could have been solved with the option enabled, or disabled respectively.


#### **7 Experimental evaluation**

We describe two experiments to establish the impact of the new rules. The first experiment compares the new rules to each other, whilst the second experiment aims to determine how helpful the new rules will be in designing extensions to Vampire's portfolio mode. This is a standard approach to evaluating the benefit of new features in an automated theorem prover [18].

Experimental Setup. We implemented the rules as immediate simplification rules in Vampire 4.5 (the implementation is available from the GitHub repository linked from the Vampire website [1], on the branch integer-arithemtic). We selected a suitable subset of problems as follows. We started with the set problems of 56,210 from SMT-LIB that involve quantifiers and arithmetic. In a first step we filtered out benchmarks that Vampire could solve within 1 second in **both** default mode (which involves a simpler version of the rule eval), and in default mode with eval enabled. Our main experiments were carried out on the remaining set of 21,512 benchmarks, we which will refer to as **B**. Filtering out trivial benchmarks avoids the results containing noise from benchmarks that can easily be solved and is an approach recently adopted by SMT-COMP [22]. Experiments are run on a Linux cluster where each node contains two octacore 2.1 GHz Intel Xeon processors and 160GB of RAM. The raw results of our experiments can be found on GitHub<sup>3</sup>.

Experiment 1. In our first experiment we wanted to find out which are the best combinations of new rules, and whether the rules themselves have a positive impact on proof search. Therefore we ran Vampire in each of the 32 configurations **C** resulting from enabling or disabling each of the 5 groups of rules (asg, gve, eval, push−, and cancel) over **B** with a timeout of 10 seconds.

The results are given in Table 1 showing the total number of problems solved and the problems gained/lost when compared to the default mode with no options set. Each row represents the combination (union) of 16 strategies where

<sup>3</sup> https://github.com/vprover/vampire\_publications/tree/master/ experimental\_data/TACAS-2021-THEORY-REASONING


**Table 2.** The top 10 strategies in the greedy ranking of configurations.

**Table 3.** The symmetric difference in number of problems solved between the three new strategies in portfolio mode against Vampire 4.5. Each cell indicates the number of problems solved by the row solver unsolved by the column solver. The column unique lists how many problems each strategy could solve that no other strategy could. The strategy Vampire \* is what we can solve with either of the three other strategies. Vampire \* is not taken into account for uniqueness.


that option is turned on. This shows that, with the exception of evaluation, the gains outweigh the losses, sometimes considerably. This result for evaluation tells us that the other rules can still operate effectively without our new evaluation and, further, that the two evaluation methods are in some sense complementary. Therefore, whilst we explore this further, we will keep both evaluation methods. The most significant gains are with cancellation, which may be related to the fact that it is applicable to inequalities as well as equalities.

Greedy Ranking. Another way of looking at the results of Experiment 1 is to create a greedy ranking rank of all configurations **C**, starting with the set of all configurations, and ranking the configuration solving the most benchmarks in **B** as the best, ranking the one that solves most of the remaining benchmarks as second, and so on. The top 10 strategies in this ranking are given in Table 2. The overall best strategy uses all 5 of the new rules. Interestingly, the second best strategy only uses the gve rule. This ranking indicates the most promising strategies to use in our next experiment.

Experiment 2 In our second experiment we wanted to see how many new problems we can solve with the new simplification rules compared to our current

**Table 4.** Comparing our new approach, Vampire \*, against Vampire 4.5, Cvc4, and Z3 with results separated by logic. The notation (+a, <sup>−</sup>b) means that the solver solved a problems the new Vampire could not solve, and the new vampire could solve b the other solver couldn't. The entries a(b) in the column Vampire \*, list the number a of problems that could be solved by our new rules, and b the number of these problems that could not be solved by any of the other solvers.


best effort in Vampire 4.5. Therefore we ran Vampire with the three top ranking configurations of experiment 3 forced added on top of Vampire's portfolio mode. The portfolio mode executes a sequence of strategies heuristically chosen based on problem features. Forcing a configuration of new options on top of this forces each strategy to make use of the new options. We ran this experiment over **B** with a timeout of 200 seconds.

Results are given in Table 3 and show that the new rules allow Vampire to solver considerably more problems (1052) than it could before whilst losing relatively few (47). The best configuration of options (all five new rules) solves the most with the other two configurations solving roughly the same. The interesting point here is that they remain complementary, solving a large number of problems uniquely. These are the exact conditions we require for producing a new, powerful portfolio mode. It is likely that performance will improve even further when also considering other option combinations.

Finally, Table 4 compares the number of problems solved by either of the three top strategies – referred to as Vampire<sup>∗</sup> – against Vampire 4.5, Z3 [7] and Cvc4 [4]. Results are further separated by the logic in which the benchmarks belong — A stands for Arrays, UF stands for Uninterpreted Functions, DT stands for Data Types, L stands for Linear, N for Non-linear, I stands for Integers, R stands for Reals, with the final A standing for Arithmetic. Here we notice that the new rules make a considerable impact in the case of pure linear real arithmetic. This is likely due to the fact that the asg allows us to fully generalise away most linear terms and gve will be broadly applicable without uninterpreted functions. It is interesting to note that, whilst the new Vampire solves fewer problems than Cvc4, and Z3 overall, it solves many (1056, and 1350) problems that the other provers do not solve. The most striking result is that we can solve 90 new problems, neither Vampire 4.5 nor either of the state-of-the-art SMT solvers could solve.

#### **8 Conclusion**

We have motivated and introduced five new simplification rules for reasoning in the theory of arithmetic within saturation-based first-order theorem provers. These rules were implemented within the Vampire theorem prover and demonstrated to improve the reasoning power on problems taken from SMT-LIB. It remains future work to explore the ideal combinations of these rules and existing proof search heuristics. It also remains an open question whether we can design an evaluation rule and modified simplification ordering that ensures that every evaluation that we want to perform is a true simplification. As demonstrated, this is not necessary pragmatically but would be satisfying theoretically.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Deductive Stability Proofs for Ordinary Differential Equations***-*

Yong Kiam Tan(-) and Andr´e Platzer(-)

Computer Science Department, Carnegie Mellon University, Pittsburgh, USA {yongkiat,aplatzer}@cs.cmu.edu

**Abstract.** Stability is required for real world controlled systems as it ensures that those systems can tolerate small, real world perturbations around their desired operating states. This paper shows how stability for continuous systems modeled by ordinary differential equations (ODEs) can be formally verified in differential dynamic logic (dL). The key insight is to specify ODE stability by suitably nesting the dynamic modalities of dL with first-order logic quantifiers. Elucidating the logical structure of stability properties in this way has three key benefits: i) it provides a flexible means of formally specifying various stability properties of interest, ii) it yields rigorous proofs of those stability properties from dL's axioms with dL's ODE safety and liveness proof principles, and iii) it enables formal analysis of the relationships between various stability properties which, in turn, inform proofs of those properties. These benefits are put into practice through an implementation of stability proofs for several examples in KeYmaera X, a hybrid systems theorem prover based on dL.

**Keywords:** differential equations, stability, differential dynamic logic

#### **1 Introduction**

The study of stability has its roots in efforts to understand mechanical systems, particularly those arising in celestial mechanics [15,19,30]. Today, it is an important part of numerous applications in dynamical systems [34] and control theory [14,18]. This paper studies proofs of stability for continuous dynamical systems described by ordinary differential equations (ODEs), such as those used to model feedback control systems [14,18]. For such systems, ODE stability is a key correctness requirement [2] that deserves fully rigorous proofs alongside other key properties such as safety and liveness of those ODEs [28,36]. Despite this, formal stability verification has received less attention compared to proofs of safety and liveness, e.g., through reachability or deductive techniques [8].

Stability for a continuous system (or ODEs) requires that i) its system state always stays close to some desired operating state(s) when initially slightly perturbed from those operating state(s), and ii) those perturbations are eventually dissipated so the system returns to a desired operating state. These properties

<sup>-</sup> This research was sponsored by the AFOSR under grant number FA9550-16-1-0288. The first author was supported by A\*STAR, Singapore.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 181–199, 2021. https://doi.org/10.1007/978-3-030-72013-1 10

are especially crucial for engineered systems because they must be robust to real world perturbations deviating from idealized system models. Simple pendulums provide canonical examples of stability phenomena: they are always observed to settle in the rest position of Fig. 1 (bottom) after some time regardless of how they are initially released. In contrast, the inverted pendulum in Fig. 1 (top) is theoretically also at a resting position but can only be observed transiently in practice because the slightest real world perturbation will cause the pendulum to fall due to gravity. Stability explains these observations—the resting position is (asymptotically) stable while the inverted position is unstable and requires active control to ensure its stability. Proofs of safety and liveness properties are still required for the inverted pendulum under control, e.g., its controller must never generate unsafe amounts of torque and the pendulum must eventually reach the inverted position. The triumvirate of safety, liveness, and stability is required for holistic correctness of the inverted pendulum controller. crucial for subsequent investigation of hybrid systems stability [5,13,20]. 2 2

The classical way of distinguishing the aforementioned stability situations is by designing a Lyapunov function [19], i.e., an energy-like auxiliary measure satisfying certain arithmetical conditions [14,18,31] which implies that the auxiliary energy decreases along system trajectories towards local minima at the stable resting state(s), see Fig. 2. Prior approaches [1,12,17,21,33] have emphasized the need to formally verify those arithmetical conditions in order to guarantee that a conjectured Lyapunov function correctly implies stability for a given system.

**Fig. 1.** A pendulum (in green) hung by a rigid rod from a pivot (in black) perturbed from its resting state (bottom) and from its inverted, upright position (top). Perturbed states (with dashed boundaries) are faded out to show the progression of time.

**Fig. 2.** A Lyapunov function that decreases along the pendulum trajectory shown in Fig. 1 (bottom).

This paper shows how deductive proofs of ODE stability can be carried out in differential dynamic logic (dL) [25,26,27], a logic for deductive verification of hybrid systems.<sup>1</sup> The key insight is that stability properties can be specified by suitably nesting the dynamic modalities of dL with quantifiers of first-order logic. The resulting specifications are amenable to rigorous proof by combining dL's ODE safety [28] and liveness [36] proof principles with real arithmetic and first-order quantifier reasoning. This makes it possible to syntactically derive stability for a given system from the small set of dL axioms which, in turn, enables trustworthy stability proofs in the KeYmaera X theorem prover for hybrid systems [11,26]. Notably, this approach directly verifies stability specifications, which

<sup>1</sup> Hybrid systems are mathematical models describing discrete and continuous dynamics, and interactions thereof. This paper's formal understanding of ODE stability is

goes beyond verifying arithmetic that imply those specifications [1,12,17,21,33]. This is crucial for advanced stability notions because those variations generally require subtle twists to the required arithmetical conditions on their Lyapunov functions [14]; proofs of stability specifications alleviate the onus on system designers to correctly pick and check the appropriate conditions for their applications. Section 3 shows how various stability properties for ODE equilibria can be formally specified and proved in dL with Lyapunov function techniques. Section 4 generalizes those stability specifications, yielding unambiguous formal specifications of advanced stability properties from the literature [14,18], along with their derived proof rules. These specifications also provide rigorous insights into the logical relationship between various stability notions, which are used to inform their respective proofs. Section 5 illustrates the practicality of this paper's dL approach through several stability case studies formalized in KeYmaera X.

All omitted definitions and proofs are available in the supplement [35].

#### **2 Background: Differential Dynamic Logic**

This section briefly recalls the syntax and semantics of dL, focusing on its continuous fragment which has a complete axiomatization for ODE invariants [28]. Full presentations of dL, including its discrete fragment, are elsewhere [26,27].

**Syntax and Semantics.** The grammar of dL terms is as follows, where x ∈ V is a variable and <sup>c</sup> <sup>∈</sup> <sup>Q</sup> is a rational constant. These terms are polynomials over V (extensions with Noetherian functions [28] such as exp,sin, cos are possible):

$$p, q \quad ::= \begin{array}{c} x \mid c \mid p+q \mid p \cdot q \end{array}$$

The grammar of dL formulas is as follows, where ∼∈{=, =, ≥, >, ≤, <} is a comparison operator and α is a hybrid program:

$$\vdash \phi, \psi \; ::= p \sim q \mid \phi \land \psi \mid \phi \lor \psi \mid \neg \phi \mid \forall v \, \phi \mid \exists v \, \phi \mid [\alpha] \phi \mid \langle \alpha \rangle \phi$$

This grammar features atomic comparisons (p ∼ q), propositional connectives (¬, ∧, ∨), first-order quantifiers over the reals (∀ , ∃ ), and the box ([α]φ) and diamond (αφ) modality formulas which express that all or some runs of hybrid program α satisfy φ, respectively. The modalities [·],· can be freely nested with first-order and modal connectives, which is crucial for the specification of stability properties in Sections 3 and 4. Formulas not containing the modalities are formulas of first-order real arithmetic and are written as P, Q, R.

This paper focuses on the continuous fragment of hybrid programs α ≡ x = f(x) & Q, where x = f(x) is an n-dimensional system of ordinary differential equations (ODEs), x 1=f1(x),...,x <sup>n</sup>=fn(x), over variables x = (x1,...,xn), the LHS x <sup>i</sup> is the time derivative of x<sup>i</sup> and the RHS fi(x) is a polynomial over variables x. The evolution domain constraint Q specifies the set of states in which the ODE is allowed to evolve continuously. When Q is the formula true, the ODE is also written as x = f(x). For n-dimensional vectors x, y, the dot product is x*·*y def = n <sup>i</sup>=1 <sup>x</sup>iy<sup>i</sup> and <sup>x</sup><sup>2</sup> def = n <sup>i</sup>=1 x<sup>2</sup> <sup>i</sup> denotes the squared Euclidean norm. Variables z ∈V\{x} not occurring on the LHS of ODE x = f(x) are parameters that remain constant along ODE solutions. The following parametric ODE model of a simple pendulum is used as a running example.

Example 1 (Pendulum model). The ODE <sup>α</sup><sup>p</sup> <sup>≡</sup> <sup>θ</sup> <sup>=</sup> ω, ω <sup>=</sup> <sup>−</sup> <sup>g</sup> <sup>L</sup> sin(θ) − bω models a pendulum (illustrated below) suspended from a pivot by a rod of length L, where θ is the angle of displacement, ω is the angular velocity of the pendulum, and g > 0 is the gravitational constant. Parameter a = <sup>g</sup> <sup>L</sup> is a positive scaling constant and parameter b ≥ 0 is the coefficient of friction for angular velocity. The symbolic parameters a, b make analysis of α<sup>p</sup> apply to a range of concrete values, e.g., pendulums that are suspended by a long rod (with large L) are modeled by small positive values of a, while frictionless pendulums have b = 0.

A simplification of α<sup>p</sup> is used because stability analyses often concern the behavior of the pendulum near its resting (or inverted) state where θ = 0. For such nearby states with θ ≈ 0, the small angle approximation sin(θ) <sup>≈</sup> <sup>θ</sup> yields a linear ODE:<sup>2</sup>

$$
\alpha\_l \equiv \theta' = \omega,\ \omega' = -a\theta - b\omega \tag{1}
$$

An inverted pendulum is modeled by a similar ODE (illustrated on the right) under a change of coordinates. Such a pendulum requires an external torque input u(θ, ω) to maintain its stability; u(θ, ω) is determined and proved correct in Section 5.

$$
\alpha\_i \equiv \theta' = \omega,\ \omega' = a\theta - b\omega - u(\theta,\omega) \tag{2}
$$

States <sup>ν</sup> : V → <sup>R</sup> assign real values to each variable in <sup>V</sup>; the set of all states is <sup>S</sup>. The semantics of dL formula <sup>φ</sup> is the set of states [[φ]] <sup>⊆</sup> <sup>S</sup> in which <sup>φ</sup> is true [26,27], where the semantics of first-order logical connectives are defined as usual, e.g., [[φ∧ψ]] = [[φ]]∩[[ψ]]. For ODEs, the semantics of the modal operators is as follows.<sup>3</sup> Let <sup>ν</sup> <sup>∈</sup> <sup>S</sup> and <sup>ϕ</sup> : [0, T) <sup>→</sup> <sup>S</sup> for some 0 < T ≤ ∞, be the unique, right-maximal solution [6] to ODE x = f(x) with initial value ϕ(0) = ν:

$$\nu \in \left[ [x' = f(x) \&\ Q] \phi \right] \text{ iff for all } 0 \le \tau < T \text{ where } \mathfrak{q}(\zeta) \in \left[ Q \right] \text{ for all } 0 \le \zeta \le \tau.$$

$$\mathfrak{q}(\tau) \in \left[ \phi \right]$$

$$\nu \in \left[ \langle x' = f(x) \&\ Q \rangle \phi \right] \text{ iff there exists } 0 \le \tau < T \text{ such that: }$$

$$\mathfrak{q}(\tau) \in \left[ \phi \right] \text{ and } \mathfrak{q}(\zeta) \in \left[ Q \right] \text{ for all } 0 \le \zeta \le \tau$$

For a formula P the ε-neighborhood of P with respect to x is defined as <sup>U</sup>ε(P) def ≡ ∃<sup>y</sup> <sup>x</sup> <sup>−</sup> <sup>y</sup><sup>2</sup> < ε<sup>2</sup> <sup>∧</sup> <sup>P</sup>(y) , where the existentially quantified variables y are fresh in P. The neighborhood formula Uε(P) characterizes the set of states within distance ε from P, with respect to the dynamically evolving variables x.

<sup>2</sup> This linearization is justified by the Hartman-Grobman theorem [6]. A nonlinear polynomial approximation, such as sin(θ) <sup>≈</sup> <sup>θ</sup> <sup>−</sup> <sup>θ</sup><sup>3</sup>

<sup>6</sup> , can also be used. <sup>3</sup> The semantics of dL formulas is defined compositionally elsewhere [26,27].

This is useful for syntactically expressing small ε perturbations in the stability definitions of Sections 3 and 4. For formulas P of first-order real arithmetic, the ε-neighborhood, Uε(P), can be equivalently expressed in quantifier-free form by quantifier elimination [4]. For example, Uε(x = 0) is equivalent to the formula <sup>x</sup><sup>2</sup> < ε2. Formulas <sup>P</sup> and ∂P are the syntactically definable topological closure and boundary of the set characterized by P, respectively [4].

**Proof Calculus.** All derivations and proof rules are presented in a classical sequent calculus. The semantics of sequent Γ & φ is equivalent to the formula ( & <sup>ψ</sup>∈<sup>Γ</sup> <sup>ψ</sup>) <sup>→</sup> <sup>φ</sup>. A sequent is valid iff its corresponding formula is valid. Completed branches in a sequent proof are marked with ∗. Assumptions ψ ∈ Γ that have only ODE parameters as free variables remain true along ODE evolutions and are soundly kept across ODE deduction steps [26,27]. First-order real arithmetic is decidable [4] so we assume such a decision procedure and label proof steps with <sup>R</sup> when they follow from real arithmetic. Axioms and proof rules are derivable iff they can be deduced from sound dL axioms and proof rules [26,27].

Formula I is an invariant of the ODE x = f(x) & Q iff the formula I → [x = f(x) & Q]I is valid. The dL proof calculus is complete for ODE invariants [28], i.e., any true ODE invariant expressible in first-order real arithmetic can be proved in the calculus. The calculus also supports refinement reasoning [36] for proving ODE liveness properties P → x = f(x) & QR, which says that the goal R is reached along the ODE x = f(x) & Q from precondition P.

An important syntactic tool for reasoning with ODE x = f(x) is the Lie derivative of term <sup>p</sup> defined as *.* p def = - xi∈x ∂p ∂x<sup>i</sup> fi(x), whose semantic value is equal to the time derivative of the value of p along solutions ϕ of the ODE [26,28]. They are provably definable in dL using syntactic differentials [26].

#### **3 Asymptotic Stability of an Equilibrium Point**

This section presents Lyapunov's classical notion of asymptotic stability [19] and its formal specification in dL. This formalization enables the derivation of dL stability proof rules with Lyapunov functions [14,18,19,31]. Several related stability concepts are formalized in dL, along with their relationships and rules.

#### **3.1 Mathematical Preliminaries**

An equilibrium point of ODE <sup>x</sup> <sup>=</sup> <sup>f</sup>(x) is a point <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> where <sup>f</sup>(x0) = 0, so a system that starts at x<sup>0</sup> stays at x<sup>0</sup> along its continuous evolution. Such points are often interesting in real-world systems, e.g., the equilibrium point θ = 0, ω = 0 for α<sup>l</sup> from (1) is the resting state of a pendulum. For a controlled system, equilibrium points often correspond to desired steady system states where no further continuous control input (modeled as part of f(x)) is required [18].

For brevity, assume the origin 0 <sup>∈</sup> <sup>R</sup><sup>n</sup> is an equilibrium point of interest. Any other equilibrium point(s) of interest <sup>x</sup><sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> can be translated to the origin with the change of coordinates x → x − x<sup>0</sup> for the ODE (see supplement [35]).

**Fig. 3.** Solutions from points in the δ ball around the origin, like the green initial point <sup>x</sup>, remain within the <sup>ε</sup> ball around the origin 0 <sup>∈</sup> <sup>R</sup><sup>n</sup> (black dot) and asymptotically approach the origin. The latter two plots illustrate how asymptotic stability for an ODE can be broken down into a pair of (quantified) ODE safety and liveness properties.

The following definition of asymptotic stability is standard [14,18,31].<sup>4</sup>

**Definition 2 (Asymptotic stability [14,18,31]).** The origin <sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> of ODE x = f(x) is


These definitions can be understood using the resting state of the pendulum from Fig. 1 (bottom) which is asymptotically stable. When the pendulum is given a light push from its bottom resting state (formally, x < δ), it gently oscillates near that resting state (formally, x(t) < ε). In the presence of friction, these oscillations eventually dissipate so the pendulum asymptotically returns to its resting state (formally, lim<sup>t</sup>→<sup>T</sup> x(t) = 0). This behavior is local, i.e., for any given ε > 0, there exists a sufficiently small δ > 0 perturbation of the initial state that results in gentle oscillations with x(t) < ε, see Fig. 3 (left). A strong push, e.g., with δ>ε, could instead cause the pendulum to spin around on its pivot.

Remark 3. Stability and attractivity do not imply each other [31, Chapter I.2.7]. However, if the origin is stable, attractivity can be defined in a simpler way. This is proved in dL, after characterizing stability and attractivity syntactically.

#### **3.2 Formal Specification**

The formal specification of asymptotic stability in dL combines i) the dynamic modalities of dL, which are used to quantify over the dynamics of the ODE, and ii) the first-order logic quantifiers, which are used to express combinations of (topologically) local and asymptotic properties of those dynamics.

<sup>4</sup> Some definitions require, or implicitly assume, right-maximal solutions x(t) to be global, i.e., with T = ∞, see [18, Definition 4.1] and associated discussion. The definitions presented here are better suited for subsequent generalizations.

**Lemma 4 (Asymptotic stability in dL).** The origin of ODE x = f(x) is, respectively, i) *stable*, ii) *attractive*, and iii) *asymptotically stable* iff the dL formulas i) Stab(x = f(x)), ii) Attr(x = f(x)), and iii) AStab(x = f(x)) respectively are valid. Variables ε, δ are fresh, i.e., not in x, f(x).

$$\begin{aligned} \text{Stab}(x'=f(x)) & \equiv \forall \varepsilon > 0 \, \exists \delta > 0 \, \forall x \left( \mathcal{U}\_{\delta}(x=0) \to [x'=f(x)] \, \mathcal{U}\_{\varepsilon}(x=0) \right) \\ \text{Attr}(x'=f(x)) & \equiv \exists \delta > 0 \, \forall x \left( \mathcal{U}\_{\delta}(x=0) \to \text{Asym}(x'=f(x), x=0) \right) \\ \text{AStab}(x'=f(x)) & \equiv \text{Stab}(x'=f(x)) \land \text{Attr}(x'=f(x)) \end{aligned}$$

Formula Asym(x = f(x), P) ≡ ∀ε>0 x = f(x)[x = f(x)] Uε(P) characterizes the set of states that asymptotically approach P along ODE solutions.

Formula Stab(x = f(x)) is a syntactic dL rendering of the corresponding quantifiers from Def. 2. The safety property Uδ(x = 0) → [x = f(x)] Uε(x = 0) expresses that solutions starting from the δ-neighborhood of the origin always (for all times) stay safely in the ε-neighborhood, as visualized in Fig. 3 (middle).

Formula Attr(x = f(x)) uses the subformula Asym(x = f(x), x = 0) which characterizes the limit in Def. 2. Recall lim<sup>t</sup>→<sup>T</sup> x(t) = 0 iff for all ε > 0 there exists a time τ with 0 ≤ τ<T such that for all times t with τ ≤ t<T, the solution satisfies x(t) < ε, i.e., the limit requires for all distances ε > 0, the ODE solution will eventually always be within distance ε of the origin, as visualized in Fig. 3 (right). This limit is characterized using nested ·[·] modalities, together with first-order quantification according to Def. 2. More generally, formula Asym(x = f(x), P) characterizes the set of initial states where the right-maximal ODE solution asymptotically approaches P; this set is known as the region of attraction of P [18]. Thus, attractivity requires that the region of attraction of the origin contains an open neighborhood Uδ(x = 0) of the origin.

From Lemma 4, proving validity of the formula AStab(x = f(x)) yields a rigorous proof of asymptotic stability for x = f(x). However, if the origin is stable, then attractivity can be provably simplified with the following corollary.

**Corollary 5 (Stable attractivity).** The following axiom is derivable in dL. SAttr Stab(x <sup>=</sup> <sup>f</sup>(x)) <sup>→</sup> Asym(x <sup>=</sup> <sup>f</sup>(x), x=0)↔∀ε><sup>0</sup> <sup>x</sup> <sup>=</sup> <sup>f</sup>(x) Uε(x=0)

Corollary 5 simplifies the syntactic characterization of the region of attraction for stable equilibria from a nested ·[·] formula to a · formula, which is then directly amenable to ODE liveness reasoning [36]. This corollary is used to simplify proofs of asymptotic stability, as explained next.

#### **3.3 Lyapunov Functions**

Lyapunov functions are the standard tool for showing stability of general, nonlinear ODEs [14,18,31] and finding suitable Lyapunov functions is an important problem in its own right [1,9,12,17,21,23,24,33,37]. This section shows how a candidate Lyapunov function, once found, can be used to rigorously prove stability. The following proof rules derive Lyapunov stability arguments [14,18,31] syntactically in dL.

**Lemma 6 (Lyapunov functions).** The following Lyapunov function proof rules are derivable in dL.

$$\begin{array}{lcl} \text{Vays } \textit{we} \; \textit{we} \; \textit{w} \; \mathtt{u} \; \mathtt{u} \; \mathtt{u} & \mathtt{u} \; \mathtt{u} \\ \text{Lysap} & \vdash f(0) = 0 \land v(0) = 0 \quad \vdash \exists \gamma > 0 \; \forall x \; \left( 0 < ||x||^2 \leq \gamma^2 \rightarrow v > 0 \land \dot{v} \leq \mathtt{0} \right) \\ & & \vdash \textit{Stab}(x' = f(x)) \\ \text{Lysap} & \vdash \textit{f}(0) = 0 \land v(0) = 0 \quad \vdash \exists \gamma > 0 \; \forall x \; \left( 0 < ||x||^2 \leq \gamma^2 \rightarrow v > 0 \land \dot{v} < \mathtt{0} \right) \\ \text{Lysap} & \vdash \textit{A} \; \textit{Stab}(x' = f(x)) \end{array}$$

Rules Lyap≥, Lyap<sup>&</sup>gt; use the Lyapunov function v as an auxiliary, energylike function near the origin which is positive and has non-positive (resp. negative Lyap>) derivative *.* v. This guarantees that v is non-increasing (resp. decreasing) along ODE solutions near the origin, see Fig. 2. The right premise of both rules use ∃γ>0 ∀x <sup>0</sup><sup>&</sup>lt;<sup>x</sup><sup>2</sup>≤γ<sup>2</sup> → ··· to require that the Lyapunov function conditions are true in a γ-neighborhood of the origin. The subtle difference in sign condition for *.* v between rules Lyap≥, Lyap<sup>&</sup>gt; is illustrated for the pendulum.

Example 7 (Pendulum asymptotic stability). For ODE α<sup>l</sup> from (1), a suitable Lyapunov function for proving its stability [18] is v = a <sup>θ</sup><sup>2</sup> <sup>2</sup> <sup>+</sup> (bθ+ω)2+ω<sup>2</sup> <sup>4</sup> , where the Lie derivative of <sup>v</sup> along <sup>α</sup><sup>l</sup> is *.* <sup>v</sup> <sup>=</sup> <sup>−</sup><sup>b</sup> <sup>2</sup> (aθ<sup>2</sup>+ω<sup>2</sup>). Stability<sup>5</sup> is formally proved in dL for any parameter values a > 0, b ≥ 0 using rule Lyap<sup>≥</sup> because both of its resulting arithmetical premises are provable by <sup>R</sup>. The full dL derivation, also used in KeYmaera X (Section 5), is shown in the proof of Lemma 6 [35].

When b > 0, i.e., friction is non-negligible, an identical derivation with Lyap<sup>&</sup>gt; instead of Lyap<sup>≥</sup> proves asymptotic stability because <sup>−</sup><sup>b</sup> <sup>2</sup> (aθ<sup>2</sup> + ω<sup>2</sup>) is negative except at the origin. Indeed, displacements to the pendulum's resting state can only be dissipated in the presence of friction, not when b = 0.

#### **3.4 Asymptotic Stability Variations**

Asymptotic stability is a strong guarantee about the local behavior of ODE solutions near equilibrium points of interest. In certain applications, stronger stability guarantees may be needed for those equilibria [18]. This section examines two standard stability variations, shows how they can be proved in dL, and formally analyzes their logical relationship with asymptotic stability.

**Exponential stability** As the name suggests, the first stability variation, exponential stability, guarantees an exponential rate of convergence towards the equilibrium point from an initial displacement. This is useful, e.g., for bounding the time spent by a perturbed system far away from its desired operating state.

**Definition 8 (Exponential stability [14,18,31]).** The origin <sup>0</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> of ODE x = f(x) is *exponentially stable* if there are positive constants α, β, δ > 0 such that for all initial states x = x(0) with x < δ, the right-maximal ODE solution <sup>x</sup>(t) : [0, T) <sup>→</sup> <sup>R</sup><sup>n</sup> satisfies <sup>x</sup>(t)<sup>≤</sup> <sup>α</sup><sup>x</sup>(0) exp (−βt) for all times <sup>0</sup> <sup>≤</sup> t<T.

<sup>5</sup> For the trigonometric pendulum ODE α<sup>p</sup> from Example 1, the Lyapunov function <sup>v</sup> <sup>=</sup> <sup>a</sup>(1 <sup>−</sup> cos(θ)) + (bθ+ω)2+ω<sup>2</sup> <sup>4</sup> with Lie derivative *.* <sup>v</sup> <sup>=</sup> <sup>−</sup><sup>b</sup> <sup>2</sup> (aθ sin(θ) + <sup>ω</sup><sup>2</sup>) proves its stability [18] but requires arithmetic reasoning over trigonometric functions.

Exponential stability bounds the norm of solutions to ODE x = f(x) near the origin by a decaying exponential. It is specified in dL as follows.

**Lemma 9 (Exponential stability in dL).** The origin of ODE x = f(x) is *exponentially stable* iff the following dL formula is valid. Variables α, β, δ, y are fresh, i.e., not in x, f(x).

$$\begin{aligned} \text{EStab}(x'=f(x)) & \equiv \exists \alpha > 0 \, \exists \beta > 0 \, \exists \delta > 0 \, \forall x \left( \mathcal{U}\_{\delta}(x=0) \to \\ & \quad \left[ y := \alpha^2 \|x\|^2; x' = f(x), y' = -2\beta y \right] \|x\|^2 \le y \right) \end{aligned}$$

The discrete assignment <sup>y</sup> := <sup>α</sup><sup>2</sup><sup>x</sup><sup>2</sup> sets the value of variable <sup>y</sup> to that of <sup>α</sup><sup>2</sup><sup>x</sup><sup>2</sup> and ; denotes sequential composition of hybrid programs [26,27].

Formula EStab(x = f(x)) uses a fresh variable y with ODE y = −2βy and initialized to <sup>α</sup><sup>2</sup><sup>x</sup><sup>2</sup> so that <sup>y</sup> differentially axiomatizes [28] the (squared) decaying exponential function <sup>α</sup><sup>2</sup><sup>x</sup>(0)<sup>2</sup> exp (−2βt) along ODE solutions. Such an implicit (polynomial) characterization of exponential decay allows syntactic proof steps to use decidable real arithmetic reasoning.

**Lemma 10 (Lyapunov function for exponential stability).** The following Lyapunov function proof rule for exponential stability is derivable in dL, where <sup>k</sup>1, k2, k<sup>3</sup> <sup>∈</sup> <sup>Q</sup> are positive constants.

$$\text{Lyp}\_{\text{E}} \overset{\vdash}{\vdash} \exists \gamma > 0 \,\forall x \left( \|x\|^2 \le \gamma^2 \to k\_1^2 \|x\|^2 \land \dot{v} \le -2k\_3 v\right)$$

$$\vdash \text{EStab}(x' = f(x))$$

Rule LyapE enables proofs of exponential stability in dL. In fact, the proof of Lemma 10 (see supplement [35]) yields concrete, quantitative bounds, where EStab(x = f(x)) is explicitly witnessed with scaling constant α = <sup>k</sup><sup>2</sup> <sup>k</sup><sup>1</sup> and decay rate β = k3. These can be used to calculate time bounds when the system state will return sufficiently close to the origin. Similarly, the disturbance δ in EStab(x = f(x)) is quantitatively witnessed by <sup>k</sup><sup>1</sup> <sup>k</sup><sup>2</sup> γ for any γ witnessing validity of the premise of rule LyapE. This yields a provable estimate of the region around the origin where exponential stability holds; this latter estimate is explored next.

**Region of attraction** Formulas Attr(x = f(x)) and EStab(x = f(x)) both feature a subformula of the form ∃δ > 0 ∀x (Uδ(x = 0) → ···) which expresses that attractivity (or exponential stability) is locally true in some δ neighborhood of the origin. In many applications, it is useful to find and rigorously prove that a given set is attractive or exponentially stable with respect to the origin [18, Chapter 8.2]. The second stability variation yields provable subsets of the region of attraction, including the special case where it is the entire state space. This is formalized using the following variants of Attr(x = f(x)) and EStab(x = f(x)) within a region given by a formula P.

$$\begin{aligned} \text{Attr}^{\mathbb{P}}(x'=f(x),P) & \equiv \forall x \left( P \to \text{Asym}(x'=f(x),x=0) \right) \\ \text{EStab}^{\mathbb{P}}(x'=f(x),P) & \equiv \exists \alpha > 0 \, \exists \beta > 0 \, \forall x \left( P \to \\ & & \left[ y := \alpha^2 \|x\|^2; x'=f(x), y'=-2\beta y \right] \|x\|^2 \le y \right) \end{aligned}$$

The formula AttrP(x = f(x), P) is valid iff the set characterized by P is a subset of the origin's region of attraction [18]. For example, Attr(x = f(x)) is <sup>∃</sup>δ > 0 AttrP(x <sup>=</sup> <sup>f</sup>(x), <sup>U</sup>δ(<sup>x</sup> = 0)). This generalization is useful for formalizing stronger notions of stability in dL, such as the following global stability notions [14,18]. For brevity, dL specifications of the stability properties (in **bold**) are given below with mathematical definitions deferred to the supplement [35].

**Lemma 11 (Global stability in dL).** The origin of ODE x = f(x) is *globally asymptotically stable* iff the dL formula Stab(x <sup>=</sup> <sup>f</sup>(x)) <sup>∧</sup> AttrP(x <sup>=</sup> f(x), true) is valid. The origin is *globally exponentially stable* iff the dL formula EStab<sup>P</sup>(x = f(x), true) is valid.

Global stability ensures that all perturbations to the system state are eventually dissipated. Their proof rules are similar to Lyap<sup>&</sup>gt; and LyapE respectively.

**Lemma 12 (Lyapunov function for global stability).** The following Lyapunov function proof rules for global asymptotic and exponential stability are derivable in dL. In rule LyapG <sup>E</sup>, <sup>k</sup>1, k2, k<sup>3</sup> <sup>∈</sup> <sup>Q</sup> are positive constants.

$$\begin{aligned} \text{Lyp}\_{>}^{\text{G}} & \xleftarrow{\vdash} f(0) = 0 \land v(0) = 0 \quad x \neq 0 \vdash v > 0 \land \dot{v} < 0 \quad \vdash \forall b \, \exists \gamma > 0 \, \forall x \, \left( v \le b \to \mathcal{U}\_{\gamma}(x = 0) \right) \\ & \vdash \text{Stab}(x' = f(x)) \land \text{Attr}^{\text{P}}(x' = f(x), true) \\ \text{Lyp}\_{=}^{\text{G}} & \xleftarrow{\vdash} \text{Estab}^{\text{P}}(x' = f(x), true) \end{aligned}$$

Example 13 (Pendulum global exponential stability). For simplicity, instantiate Example 7 with parameters a = 1, b = 1. The Lyapunov function then simplifies to v = <sup>θ</sup><sup>2</sup> <sup>2</sup> <sup>+</sup> (θ+ω)2+ω<sup>2</sup> <sup>4</sup> with Lie derivative *.* <sup>v</sup> <sup>=</sup> <sup>−</sup>(θ2+ω2) <sup>2</sup> , which satisfies the real arithmetic inequalities <sup>θ</sup>2+ω<sup>2</sup> <sup>4</sup> <sup>≤</sup> <sup>v</sup> <sup>≤</sup> <sup>θ</sup><sup>2</sup> <sup>+</sup> <sup>ω</sup><sup>2</sup> and *.* <sup>v</sup> ≤ −<sup>1</sup> <sup>2</sup> <sup>v</sup>. Thus, rule Lyap<sup>G</sup> E proves global exponential stability of α<sup>l</sup> with k<sup>1</sup> = <sup>1</sup> <sup>2</sup> , <sup>k</sup><sup>2</sup> = 1, and <sup>k</sup><sup>3</sup> <sup>=</sup> <sup>1</sup> <sup>4</sup> . An important caveat is that Example 7 used a local small angle approximation, so this global phenomenon does not hold for a real world pendulum (nor for αp).

**Logical relationships** With the proliferation of stability variations just introduced, it is useful to take stock of their logical relationships. An important example of such a relationship is shown in the following corollary.

**Corollary 14 (Exponential stability implies asymptotic stability).** The following axioms are derivable in dL. EStabStab EStab(x = f(x)) → Stab(x = f(x)) EStabAttr EStab<sup>P</sup>(x <sup>=</sup> <sup>f</sup>(x), P) <sup>→</sup> Attr<sup>P</sup>(x <sup>=</sup> <sup>f</sup>(x), P)

Derived axioms EStabStab, EStabAttr show that exponential stability implies asymptotic stability. In proofs, EStabAttr allows the region of attraction to be estimated using the region where solutions are exponentially bounded.

#### **4 General Stability**

This section provides stability definitions and proof rules that generalize stability for an equilibrium point from Section 3 to the stability of sets. These definitions are useful when the desired stable system state(s) is not modeled by a single equilibrium point, but may instead, e.g., lie on a periodic trajectory [18], a hyperplane, or a continuum of equilibrium points within the state space [14]. The generalized definition is used to formalize two stability notions from the literature [14,18], and to justify their Lyapunov function proof rules.

#### **4.1 General Stability and General Attractivity**

The following general stability formula defines stability in dL with respect to an ODE x = f(x) and formulas P, R. The quantified variables ε, δ are assumed to be fresh by bound renaming, i.e., do not appear in x, f(x), P or R.

$$\operatorname{Stab}\_{\mathcal{R}}^{\mathcal{P}}(x'=f(x),P,R) \equiv \forall \varepsilon > 0 \, \exists \delta > 0 \, \forall x \left( \mathcal{U}\_{\delta}(P) \to [x'=f(x)]\mathcal{U}\_{\varepsilon}(R) \right)$$

This formula generalizes stability of the origin Stab(x = f(x)) by adding two logical tuning knobs that can be intuitively understood as follows. The precondition P characterizes the initial states from which the system state is expected to be disturbed by some disturbance δ. The postcondition R characterizes the set of desired operating states that the system must remain close (within the ε neighborhood of R) after being disturbed from its initial states.

The general attractivity formula similarly generalizes Attr<sup>P</sup>(x = f(x), P) with a postcondition R towards which the ODE solutions from initial states satisfying precondition P are asymptotically attracted.

$$\operatorname{Attr}\_{\mathcal{R}}^{\mathbb{P}}(x'=f(x),P,R) \equiv \forall x \left(P \to \operatorname{Asym}(x'=f(x),R)\right).$$

**Lemma 15 (General Lyapunov functions).** The following Lyapunov function proof rule for general stability with two stacked premises is derivable in dL. & P → R

$$\begin{array}{c} \vdash \forall \varepsilon > 0 \, \exists 0 < \gamma \leq \varepsilon \, \exists k \left( \begin{array}{l} \forall x \left( \partial (\mathcal{U}\_{\gamma}(R)) \to v \geq k \right) \wedge \\\exists 0 < \delta \leq \gamma \, \forall x \left( \mathcal{U}\_{\delta}(P) \to R \lor v < k \right) \wedge \\\forall x \left( R \lor v < k \to [x' = f(x) \& \overline{\mathcal{U}\_{\gamma}(R)}](R \lor v < k) \right) \end{array} \right) \\\hline \vdash \textit{Stab}\_{\mathcal{R}}^{\mathcal{P}}(x' = f(x), P, R) \end{array}$$

Rule GLyap proves general stability for precondition P and postcondition R. It generalizes the Lyapunov function reasoning underlying rule Lyap<sup>≥</sup> to support arbitrary pre- and postconditions. The conjunct ∀x (∂(Uγ(R)) → v ≥ k) requires v≥k on the boundary of Uγ(R) while the middle conjunct requires v<k for some small neighborhood of P excluding R. The conjunct ∀x <sup>R</sup>∨v<k →··· asserts that R ∨ v<k is an invariant of the ODE within closed domain Uγ(R). When R is a formula of first-order real arithmetic, this invariance question is provably equivalent in dL to a formula of real arithmetic [28], so the premise of rule GLyap is, in theory, decidable by R for a given candidate Lyapunov function v. In practice, it is prudent to consider specialized stability notions, for which the premise of rule GLyap can be arithmetically simplified. Proof rules for generalized attractivity are also derivable for specialized instances.

#### **4.2 Specialization**

General stability specializes to several stability notions in the literature. For brevity, dL specifications of the stability properties (in **bold**) are given below with mathematical definitions deferred to the supplement [35].

**Set Stability** An important special case is when the desired operating states are exactly the states from which disturbances are expected, i.e., R ≡ P. This leads to the notion of **set stability** of the set characterized by P [14,18].

**Lemma 16 (Set Stability in dL).** For the ODE x = f(x), the set characterized by formula P is i) *stable*, ii) *attractive*, iii) *asymptotically stable*, and iv) *globally asymptotically stable* iff the following dL formulas are valid:


The intuition for Lemma 16 is similar to Lemmas 4 and 11, except formula P (instead of the origin) characterizes the set of desirable states. An application of set stability is shown in the following example.

Example 17 (Tennis racket theorem [3]). The following system of ODEs models the rotation of a 3D rigid body [6,14], where x1, x2, x<sup>3</sup> are angular velocities and I<sup>1</sup> > I<sup>2</sup> > I<sup>3</sup> > 0 are the principal moments of inertia along the respective axes.

$$\alpha\_r \equiv x\_1' = \frac{I\_2 - I\_3}{I\_1} x\_2 x\_3, \quad x\_2' = \frac{I\_3 - I\_1}{I\_2} x\_3 x\_1, \quad x\_3' = \frac{I\_1 - I\_2}{I\_3} x\_1 x\_2$$

When such a rigid object is spun or rotated on each of its axes, a well-known physical curiosity [3] is that the rotation is stable in the first and third axes, whilst additional (unstable) twisting motion is observed for the intermediate axis. Mathematically, a perfect rotation, e.g., around x1, corresponds to a (large) initial value for x<sup>1</sup> with no rotation in the other axes, i.e., x<sup>2</sup> = 0, x<sup>3</sup> = 0. Accordingly the real world observation of stability for rotations about the first principal axis is explained by stability with respect to small perturbations in x2, x3, as formally specified by formula (3) below. Note that the set characterized by formula x<sup>2</sup> = 0∧x<sup>3</sup> = 0 is the entire x<sup>1</sup> axis, not just a single point. Similarly, rotations are stable around the third principal axis iff formula (4) is valid.

$$\text{Stab}\_{\text{R}}^{\text{P}}(\alpha\_r, x\_2 = 0 \land x\_3 = 0, x\_2 = 0 \land x\_3 = 0) \tag{3}$$

$$\text{Stab}\_{\text{R}}^{\text{P}}(\alpha\_r, x\_1 = 0 \land x\_2 = 0, x\_1 = 0 \land x\_2 = 0) \tag{4}$$

The validity of formulas (3) and (4) are proved in Example 20.

The formal specification of set stability yields three provable logical consequences which are important stepping stones for the set stability proof rules.

**Corollary 18 (Set stability properties).** The following axioms are derivable in dL. In axiom SClosure, formula P characterizes the topological closure of formula P. In axiom SClosed, formula P characterizes a closed set. SetSAttr Stab<sup>P</sup> <sup>R</sup>(x = f(x), P, P) <sup>→</sup> Asym(x = f(x), P) ↔ ∀ε>0 x = f(x) Uε(P) SClosure Stab<sup>P</sup> <sup>R</sup>(x <sup>=</sup> <sup>f</sup>(x), P, P) <sup>↔</sup> Stab<sup>P</sup> <sup>R</sup>(x = f(x), P , P) SClosed Stab<sup>P</sup> <sup>R</sup>(x = f(x), P, P) → ∀x P → [x = f(x)]P 

Axiom SetSAttr generalizes SAttr and provides a syntactic simplification of the region of attraction for formula P when P is stable. Axiom SClosure says that stability of P is equivalent to stability of its closure P, because for any perturbation δ > 0, the neighborhoods Uδ(P) and Uδ(P) are provably equivalent in real arithmetic. Axiom SClosed says that for closed formulas P, invariance of P is a necessary condition for stability of P. Without loss of generality, it suffices to develop proof rules for stability of formulas characterizing closed (using SClosure) and invariant (using SClosed) sets. Indeed, standard definitions of set stability [14,18] usually assume that the set of concern is closed and invariant.

**Lemma 19 (Set stability Lyapunov functions).** The following Lyapunov function proof rules for set stability are derivable in dL. In derived rules SLyap<sup>≥</sup> and SLyap>, formula P characterizes a compact (i.e., closed and bounded) set. In derived rule SLyap<sup>∗</sup> <sup>≥</sup>, the two premises are stacked.

$$\begin{array}{llll} \text{SLJup} & P \vdash [x'=f(x)]P & \neg P \vdash v > 0 \land \dot{v} \le 0 & \partial P \vdash v \le 0\\ & \vdash \text{Stab}^{\mathbb{P}}\_{\text{R}}(x'=f(x), P, P) &\\ \text{SLJup}\_{>} & P \vdash [x'=f(x)]P & \neg P \vdash v > 0 \land \dot{v} < 0 & \partial P \vdash v \le 0\\ \text{SLyap}\_{>} & \vdash \text{Stab}^{\mathbb{P}}\_{\text{R}}(x'=f(x), P, P) \land \exists \delta > 0 \; \text{Attr}^{\mathbb{P}}\_{\text{R}}(x'=f(x), l\delta\_{\delta}(P), P) \\ & & P \vdash [x'=f(x)]P \\ & \vdash \forall \varepsilon > 0 \; \exists 0 < \gamma \le \varepsilon \; \left(\begin{array}{l} \exists k \; \left(\forall x\left(\partial\mathcal{U}\_{\gamma}(P)\right) \to v \ge k\right) \land\\ \exists 0 < \delta \le \gamma \, \forall x\left(\mathcal{U}\_{\delta}(P) \land \neg P \to v < k\right) \end{array}\right) \land\\ \text{SLyap}^{\*}\_{\ge} & \vdash \text{Stab}^{\mathbb{P}}\_{\text{R}}(x'=f(x), P, P) \end{array}$$

All three proof rules have the necessary premise P & [x = f(x)]P which says that formula P is an invariant of the ODE x = f(x). Rules SLyap≥, SLyap<sup>&</sup>gt; are slight generalizations of Lyapunov function proof rules for set stability [14] and they respectively generalize rules Lyap≥, Lyap<sup>&</sup>gt; to prove stability for an invariant P. Importantly, both rules assume that P characterizes a compact, i.e., closed and bounded set, which simplifies the arithmetical conditions on v in their premises. The rule without the boundedness requirement on P suggested in the remark after [18, Definition 8.1], is unsound, see supplement [35].

For asymptotic stability (in rule SLyap>), boundedness also guarantees that perturbed ODE solutions always exist for sufficient duration, which is a fundamental step in the ODE liveness proofs [36]. Rule SLyap∗ <sup>≥</sup> is derived from rule GLyap using invariance of P by the first premise; it provides a means of formally proving the set stability properties (3) and (4) from Example 17.

Example 20 (Stability of rigid body motion). The proof for (3) uses the Lyapunov function v = <sup>1</sup> <sup>2</sup> ( <sup>I</sup>1−I<sup>2</sup> <sup>I</sup><sup>3</sup> <sup>x</sup><sup>2</sup> <sup>2</sup> <sup>−</sup> <sup>I</sup>3−I<sup>1</sup> <sup>I</sup><sup>2</sup> <sup>x</sup><sup>2</sup> <sup>3</sup>), whose Lie derivative is *.* v = 0, and rule SLyap∗ <sup>≥</sup> with formula <sup>P</sup> <sup>≡</sup> <sup>x</sup><sup>2</sup> = 0 <sup>∧</sup> <sup>x</sup><sup>3</sup> = 0. The proof for (4) is symmetric. For the top premise of rule SLyap∗ <sup>≥</sup>, formula <sup>P</sup> is a provable invariant [28] of the ODE αr. The bottom premise, although arithmetically complicated, can be simplified by choosing γ = ε and deciding the resulting formula by <sup>R</sup>.

Recall that the x<sup>1</sup> axis is not a compact set so neither of the standard proof rules for set stability SLyap≥, SLyap<sup>&</sup>gt; would be sound for this proof.

**Epsilon-Stability** Motivated by numerical robustness of proofs of stability, Gao et al. [12] define ε-stability for ODEs. The following dL characterization shows how ε-stability can be understood as an instance of general stability.

**Lemma 21 (**ε**-Stability in dL).** The origin of ODE x = f(x) is *ε-stable* for constant ε > 0 iff the dL formula Stab<sup>P</sup> <sup>R</sup>(x = f(x), x = 0, Uε(x = 0)) is valid.

Unlike set stability, ε-stability is an instance of general stability where the pre- and postconditions differ. In ε-stability, systems are perturbed from the precondition x = 0 (the origin), but the postcondition enlarges the set of desired states to a ε > 0 neighborhood of the origin, which is considered indistinguishable from the origin itself [12]. An immediate consequence of Lemma 21 is that rule GLyap can be used to prove ε-stability, as shown in the next section.

#### **5 Stability in KeYmaera X**

This section puts the dL stability specifications and derivations from the preceding sections into practice through proofs for several case studies in the KeYmaera X theorem prover [11].<sup>6</sup> Examples 7, 13, 17, 20 have also been formalized. The insights from these proofs are discussed after an overview of the case studies.

Inverted Pendulum. The stability of the resting state of the pendulum is investigated in Examples 7 and 13. For the inverted pendulum α<sup>i</sup> from (2), the controlled torque u(θ, ω) must be designed and rigorously proved to ensure feedback stabilization [18] of the inverted position. A standard PD (Proportional-Derivative) controller can be used for stabilization, where the control input has the form u(θ, ω) = k1θ + k2ω for tuning parameters k1, k2. Asymptotic stability of the inverted position is achieved for any control parameter choice where k<sup>1</sup> > a and k<sup>2</sup> > −b. The sequent a > 0, b ≥ 0, k<sup>1</sup> > a, k<sup>2</sup> > −b & AStab(αi) is proved in KeYmaera X using the Lyapunov function (k1−a)θ<sup>2</sup> <sup>2</sup> <sup>+</sup> (((b+k2)θ+ω)2+ω2) <sup>4</sup> .

<sup>6</sup> See https://github.com/LS-Lab/KeYmaeraX-projects/blob/master/stability

Frictional Tennis Racket Theorem. The stability of a 3D rigid body is investigated for α<sup>r</sup> in Examples 17 and 20. The following ODEs model additional frictional forces that oppose the rotational motion in each axis of the rigid body, where α1, α2, α<sup>3</sup> > 0 are positive coefficients of friction:

$$\alpha\_f \equiv x\_1' = \frac{I\_2 - I\_3}{I\_1} x\_2 x\_3 - \alpha\_1 x\_1, \; x\_2' = \frac{I\_3 - I\_1}{I\_2} x\_3 x\_1 - \alpha\_2 x\_2, \; x\_3' = \frac{I\_1 - I\_2}{I\_3} x\_1 x\_2 - \alpha\_3 x\_3$$

In the presence of friction, rotations of the rigid body are globally asymptotically stable in the first and third principal axes, as proved in KeYmaera X.

$$\begin{aligned} \Gamma & \equiv I\_1 > I\_2, I\_2 > I\_3, I\_3 > 0, \alpha\_1 > 0, \alpha\_2 > 0, \alpha\_3 > 0\\ \Gamma & \vdash \text{Stab}^{\text{P}}\_{\text{R}}(\alpha\_f, x\_2 = 0 \land x\_3 = 0, x\_2 = 0 \land x\_3 = 0) \land \text{Attr}^{\text{P}}\_{\text{R}}(\alpha\_f, true, x\_2 = 0 \land x\_3 = 0) \\ \Gamma & \vdash \text{Stab}^{\text{P}}\_{\text{R}}(\alpha\_f, x\_1 = 0 \land x\_2 = 0, x\_1 = 0 \land x\_2 = 0) \land \text{Attr}^{\text{P}}\_{\text{R}}(\alpha\_f, true, x\_1 = 0 \land x\_2 = 0) \end{aligned}$$

Both asymptotic stability properties are proved using SLyap<sup>∗</sup> <sup>≥</sup> and the liveness property [36] that the kinetic energy I1x<sup>2</sup> <sup>1</sup> +I2x<sup>2</sup> <sup>2</sup> +I3x<sup>2</sup> <sup>3</sup> of the system tends to zero over time. The latter property implies that solutions of α<sup>f</sup> exist globally and that the values of x1, x2, x<sup>3</sup> asymptotically tend to zero, which proves global asymptotic stability with the aid of SetSAttr. Even though a proof rule for (global) asymptotic stability of general nonlinear ODEs and unbounded sets is not available (Section 4), this example shows that formalized stability properties can still be proved on a case-by-case basis using dL's ODE reasoning principles.

Moore-Greitzer Jet Engine [12]. The origin of the ODE modeling a simplified jet engine α<sup>m</sup> ≡ x <sup>1</sup> <sup>=</sup> <sup>−</sup>x<sup>2</sup> <sup>−</sup> <sup>3</sup> 2x<sup>2</sup> <sup>1</sup> <sup>−</sup> <sup>1</sup> 2x<sup>3</sup> 1, x <sup>2</sup> = 3x<sup>1</sup> − x<sup>2</sup> is ε-stable for <sup>ε</sup> = 10−<sup>10</sup> [12]. The sequent <sup>ε</sup> = 10−<sup>10</sup> & Stab<sup>P</sup> R(αm, x<sup>2</sup> <sup>1</sup> + x<sup>2</sup> <sup>2</sup> = 0, x<sup>2</sup> <sup>1</sup> + x<sup>2</sup> <sup>2</sup> < ε<sup>2</sup>) is proved in KeYmaera X. The key proof ingredients are an ε-Lyapunov function [12] and manual arithmetic steps, e.g., instantiating existential quantifiers appearing in the specification of ε-stability with appropriate values [12].

Other Examples [1]. Stability for several ODEs with Lyapunov functions generated by an inductive synthesis technique [1, Examples 5–11] were successfully verified in KeYmaera X. The proof for the largest, 6-dim. nonlinear ODE [1, Example 5] required substantial manual arithmetic reasoning in KeYmaera X.<sup>7</sup>

The arithmetical conditions in [1, Equation 1] are identical to the premises of rule Lyap<sup>≥</sup> except it unsoundly omits the condition v(0) = 0, see supplement [35]. The generated Lyapunov functions remain correct because the inductive synthesis technique [1] implicitly guarantees this omitted condition.

Summary. These case studies demonstrate the feasibility of carrying out proofs of various (advanced) stability properties within KeYmaera X using this paper's stability specifications. The proofs share similar high-level proof structure, which suggests that proof automation could significantly reduce proof effort [10]. Such automation should also support user input of key insights for difficult reasoning steps, e.g., real arithmetic reasoning with nested, alternating quantifiers.

<sup>7</sup> The Lyapunov function in [1, Example 5] does not work for its associated ODE. It works if the ODE is corrected with ˙x<sup>1</sup> <sup>=</sup> <sup>−</sup>x<sup>3</sup> <sup>1</sup> + 4x<sup>3</sup> <sup>2</sup> −6x3x4, as in the literature [23].

#### **6 Related Work**

Stability is a fundamental property of interest across many different fields of mathematics [6,15,19,30,31,34] and engineering [14,18,20]. This related work discussion focuses on formal approaches to stability of ODEs.

Logical specification of stability. Rouche, Habets, and Laloy [31] provide a pioneering example of using logical notation to specify and classify stability properties of ODEs. Alternative logical frameworks have also been used to specify stability and related properties: stability is expressed in HyperSTL [22] as a hyperproperty relating the trace of an ODE against two constant traces; -stability is studied in the context of δ-complete reasoning over the reals [12]; region stability for hybrid systems [29] is discussed using CTL\*; the syntactic specification of Asym(x = f(x), P) resembles the limit definition using filters [16]. This paper uses dL as a sweet spot logical framework, general enough to specify various stability properties of interest, e.g., asymptotic or exponential stability, and the stability of sets, while also enabling syntactic proofs of those properties.

Formal verification of stability. There is a vast literature on finding Lyapunov functions for stability, e.g., through numerical [24,23,37] and algebraic methods [9,21]. Formal approaches are often based on finding Lyapunov function candidates and certifying the correctness of those generated candidates [1,12,17,33]. This paper's approach enables highly trustworthy certification of those candidates in dL and KeYmaera X, with stability proof rules that are soundly derived from dL's parsimonious axiomatization [25,26,27], as implemented in KeYmaera X [11,26]. Sections 4 and 5 further show that this paper's approach supports verification of advanced stability properties [12,14,18] within the same dL framework. New stability proof rules like GLyap can also be soundly and syntactically justified in dL without the need for (low-level) semantic reasoning about the underlying ODE mathematics. As an example of the latter, semantic approach, LaSalle's invariance principle is formalized in Coq [7] and used to verify the correctness of an inverted pendulum controller [32].

#### **7 Conclusion**

This paper shows how ODE stability can be formalized in dL using the key idea that stability properties are ∀ /∃ -quantified dynamical formulas. These specifications, their proof rules, and their logical relationships are all syntactically derived from dL's sound proof calculus. This further enables trustworthy KeYmaera X proofs that rigorously verify every step in an ODE stability argument, from arithmetical premises down to dynamical reasoning for ODEs. Directions for future work include i) formalization of stability with respect to perturbations of the system dynamics, and ii) generalizations of stability to hybrid systems.

**Acknowledgments.** We thank Brandon Bohrer, Stefan Mitsch, and the anonymous reviewers for their helpful feedback on KeYmaera X and this paper.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Tool Papers**

### **An SMT-Based Approach for Verifying Binarized Neural Networks**

Guy Amir1, Haoze Wu2, Clark Barrett2, and Guy Katz1[-

**Consistent \* Complete \* Well Documen et d t ysaE \* o Reuse \*** \* **Eva ul det a** \* TACAS \* **Artifact** \* AEC

]

<sup>1</sup> The Hebrew University of Jerusalem, Jerusalem, Israel {guy.amir2, g.katz}@mail.huji.ac.il <sup>2</sup> Stanford University, Stanford, USA {haozewu, barrett}@cs.stanford.edu

**Abstract.** Deep learning has emerged as an effective approach for creating modern software systems, with neural networks often surpassing hand-crafted systems. Unfortunately, neural networks are known to suffer from various safety and security issues. Formal verification is a promising avenue for tackling this difficulty, by formally certifying that networks are correct. We propose an SMT-based technique for verifying binarized neural networks — a popular kind of neural network, where some weights have been binarized in order to render the neural network more memory and energy efficient, and quicker to evaluate. One novelty of our technique is that it allows the verification of neural networks that include both binarized and non-binarized components. Neural network verification is computationally very difficult, and so we propose here various optimizations, integrated into our SMT procedure as deduction steps, as well as an approach for parallelizing verification queries. We implement our technique as an extension to the Marabou framework, and use it to evaluate the approach on popular binarized neural network architectures.

#### **1 Introduction**

In recent years, deep neural networks (DNNs) [21] have revolutionized the state of the art in a variety of tasks, such as image recognition [12, 37], text classification [39], and many others. These DNNs, which are artifacts that are generated automatically from a set of training data, generalize very well — i.e., are very successful at handling inputs they had not encountered previously. The success of DNNs is so significant that they are increasingly being incorporated into highly-critical systems, such as autonomous vehicles and aircraft [7, 30].

In order to tackle increasingly complex tasks, the size of modern DNNs has also been increasing, sometimes reaching many millions of neurons [46]. Consequently, in some domains, DNN size has become a restricting factor: huge networks have a large memory footprint, and evaluating them consumes both time and energy. Thus, resource-efficient networks are required in order to allow DNNs to be deployed on resource-limited, embedded devices [23, 42].

One promising approach for mitigating this problem is via DNN quantization [4, 27]. Ordinarily, each edge in a DNN has an associated weight, typically c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 203–222, 2021. https://doi.org/10.1007/978-3-030-72013-1\_11

stored as a 32-bit floating point number. In a quantized network, these weights are stored using fewer bits. Additionally, the activation functions used by the network are also quantized, so that their outputs consist of fewer bits. The network's memory footprint thus becomes significantly smaller, and its evaluation much quicker and cheaper. When the weights and activation function outputs are represented using just a single bit, the resulting network is called a binarized neural network (BNN ) [26]. BNNs are a highly popular variant of a quantized DNN [10, 40, 56, 57], as their computing time can be up to 58 times faster, and their memory footprint 32 times smaller, than that of traditional DNNs [45]. There are also network architectures in which some parts of the network are quantized, and others are not [45]. While quantization leads to some loss of network precision, quantized networks are sufficiently precise in many cases [45].

In recent years, various security and safety issues have been observed in DNNs [33, 48]. This has led to the development of a large variety of verification tools and approaches (e.g., [16, 25, 33, 52], and many others). However, most of these approaches have not focused on binarized neural networks, although they are just as vulnerable to safety and security concerns as other DNNs. Recent work has shown that verifying quantized neural networks is PSPACE-hard [24], and that it requires different methods than the ones used for verifying non-quantized DNNs [18]. The few existing approaches that do handle binarized networks focus on the strictly binarized case, i.e., on networks where all components are binary, and verify them using a SAT solver encoding [29, 43]. Neural networks that are only partially binarized [45] cannot be readily encoded as SAT formulas, and thus verifying these networks remains an open problem.

Here, we propose an SMT-based [5] approach and tool for the formal verification of binarized neural networks. We build on top of the Reluplex algorithm [33],<sup>3</sup> and extend it so that it can support the sign function,

$$\text{sign}(x) = \begin{cases} x < 0 & -1 \\ x \ge 0 & 1. \end{cases}$$

We show how this extension, when integrated into Reluplex, is sufficient for verifying BNNs. To the best of our knowledge, the approach presented here is the first capable of verifying BNNs that are not strictly binarized. Our technique is implemented as an extension to the open-source Marabou framework [2, 34]. We discuss the principles of our approach and the key components of our implementation. We evaluate it both on the XNOR-Net BNN architecture [45], which combines binarized and non-binarized parts, and on a strictly binarized network.

The rest of this paper is organized as follows. In Section 2, we provide the necessary background on DNNs, BNNs, and the SMT-based formal verification of DNNs. Next, we present our SMT-based approach for supporting the sign activation function in Section 3, followed by details on enhancements and optimizations for the approach in Section 4. We discuss the implementation of our tool in Section 5, and its evaluation in Section 6. Related work is discussed in Section 7, and we conclude in Section 8.

<sup>3</sup> [33] is a recent extended version of the original Reluplex paper [31].

#### **2 Background**

**Deep Neural Networks.** A deep neural network (DNN) is a directed graph, where the nodes (also called neurons) are organized in layers. The first layer is the input layer, the last layer is the output layer, and the intermediate layers are the hidden layers. When the network is evaluated, the input neurons are assigned initial values (e.g., the pixels of an image), and these values are then propagated through the network, layer by layer, all the way to the output layer. The values of the output neurons determine the result returned to the user: often, the neuron with the greatest value corresponds to the output class that is returned. A network is called feed-forward if outgoing edges from neurons in layer i can only lead to neurons in layer j if j>i. For simplicity, we will assume here that outgoing edges from layer i only lead to the consecutive layer, i + 1.

Each layer in the neural network has a layer type, which determines how the values of its neurons are computed (using the values of the preceding layer's neurons). One common type is the weighted sum layer: neurons in this layer are computed as a linear combination of the values of neurons from the preceding layer, according to predetermined edge weights and biases. Another common type of layer is the rectified linear unit (ReLU ) layer, where each node y is connected to precisely one node x from the preceding layer, and its value is computed by y = ReLU(x) = max(0, x). The max-pooling layer is also common: each neuron y in this layer is connected to multiple neurons x1,...,x<sup>k</sup> from the preceding layer, and its value is given by y = max(x1,...,xk).

More formally, a DNN <sup>N</sup> with <sup>k</sup> inputs and <sup>m</sup> outputs is a mapping <sup>R</sup><sup>k</sup> <sup>→</sup> R<sup>m</sup>. It is given as a sequence of layers L1,...,Ln, where L<sup>1</sup> and L<sup>n</sup> are the input and output layers, respectively. We denote the size of layer L<sup>i</sup> as si, and its individual neurons as v<sup>1</sup> <sup>i</sup> ,...,v<sup>s</sup><sup>i</sup> <sup>i</sup> . We use V<sup>i</sup> to denote the column vector [v<sup>1</sup> <sup>i</sup> ,...,v<sup>s</sup><sup>i</sup> i ] <sup>T</sup> . During evaluation, the input values V<sup>1</sup> are given, and V2,...,V<sup>n</sup> are computed iteratively. The network also includes a mapping <sup>T</sup><sup>N</sup> : <sup>N</sup> → T , such that T(i) indicates the type of hidden layer i. For our purposes, we focus on layer types T = {weighted sum, ReLU, max}, but of course other types could be included. If Tn(i) = weighted sum, then layer L<sup>i</sup> has a weight matrix W<sup>i</sup> of dimensions s<sup>i</sup> × s<sup>i</sup>−<sup>1</sup> and a bias vector B<sup>i</sup> of size si, and its values are computed as V<sup>i</sup> = W<sup>i</sup> · V<sup>i</sup>−<sup>1</sup> + Bi. For Tn(i) = ReLU, the ReLU function is applied to each neuron, i.e. v<sup>j</sup> <sup>i</sup> = ReLU(v<sup>j</sup> <sup>i</sup>−<sup>1</sup>) (we required that <sup>s</sup><sup>i</sup> <sup>=</sup> <sup>s</sup><sup>i</sup>−<sup>1</sup> in this case). If Tn(i) = max, then each neuron v<sup>j</sup> <sup>i</sup> in layer L<sup>i</sup> has a list src of source indices, and its value is computed as v<sup>j</sup> <sup>i</sup> = max<sup>k</sup>∈src <sup>v</sup><sup>k</sup> <sup>i</sup>−<sup>1</sup>.

A simple illustration appears in Fig. 1. This network has a weighted sum layer and a ReLU layer as its hidden layers, and a weighted sum layer as its output layer. For the weighted sum layers, the weights and biases are listed in the figure. On input V<sup>1</sup> = [1, 2]<sup>T</sup> , the first

Fig. 1: A toy DNN.

layer's neurons evaluate to <sup>V</sup><sup>2</sup> = [6, <sup>−</sup>1]<sup>T</sup> . After ReLUs are applied, we get V<sup>3</sup> = [6, 0]<sup>T</sup> , and finally the output is V<sup>4</sup> = [6].

**Binarized Neural Networks.** In a binarized neural network (BNN ), the layers are typically organized into binary blocks, regarded as units with binary inputs and outputs. Following the definitions of Hubara et al. [26] and Narodytska et al. [43], a

Fig. 2: A toy BNN with a single binary block composed of three layers: a weighted sum layer, a batch normalization layer, and a sign layer.

binary block is comprised of three layers: (i) a weighted sum layer, where each entry of the weight matrix W is either 1 or −1; (ii) a batch normalization layer, which normalizes the values from its preceding layer (this layer can be regarded as a weighted sum layer, where the weight matrix W has real-valued entries in its diagonal, and 0 for all other entries); and (iii) a sign layer, which applies the sign function to each neuron in the preceding layer. Because each block ends with a sign layer, its output is always a binary vector, i.e. a vector whose entries are ±1. Thus, when several binary blocks are concatenated, the inputs and outputs of each block are always binary. Here, we call a network strictly binarized if it is composed solely of binary blocks (except for the output layer). If the network contains binary blocks but also additional layers (e.g., ReLU layers), we say that it is a partially binarized neural network. BNNs can be made to fit into our definitions by extending the set T to include the sign function. An example appears in Fig. 2; for input <sup>V</sup><sup>1</sup> = [−1, 3]<sup>T</sup> , the network's output is <sup>V</sup><sup>5</sup> = [−2].

**SMT-Based Verification of Deep Neural Networks.** Given a DNN N that transforms an input vector x into an output vector y = N(x), a pre-condition P on x, and a post-condition Q on y, the DNN verification problem [33] is to determine whether there exists a concrete input x<sup>0</sup> such that P(x0)∧Q(N(x0)). Typically, Q represents an undesirable output of the DNN, and so the existence of such an x<sup>0</sup> constitutes a counterexample. A sound and complete verification engine should return a suitable x<sup>0</sup> if the problem is satisfiable (SAT), or reply that it is unsatisfiable (UNSAT). As in most DNN verification literature, we will restrict ourselves to the case where P and Q are conjunctions of linear constraints over the input and output neurons, respectively [16, 33, 52].

Here, we focus on an SMT-based approach for DNN verification, which was introduced in the Reluplex algorithm [33] and extended in the Marabou framework [2, 34]. It entails regarding the DNN's node values as variables, and the verification query as a set of constraints on these variables. The solver's goal is to find an assignment of the DNN's nodes that satisfies P and Q. The constraints are partitioned into two sets: linear constraints, i.e. equations and variable lower and upper bounds, which include the input constraints in P, the output constraints in Q, and the weighted sum layers within the network; and piecewise-linear constraints, which include the activation function constraints, such as ReLU or max constraints. The linear constraints are easier to solve (specifically, they can be phrased as a linear program [6], solvable in polynomial time); whereas the piecewise-linear constraints are more difficult, and render the problem NP-complete [33]. We observe that sign constraints are also piecewiselinear.

In Reluplex, the linear constraints are solved iteratively, using a variant of the Simplex algorithm [13]. Specifically, Reluplex maintains a variable assignment, and iteratively corrects the assignments of variables that violate a linear constraint. Once the linear constraints are satisfied, Reluplex attempts to correct any violated piecewise-linear constraints — again by making iterative adjustments to the assignment. If these steps re-introduce violations in the linear constraints, these constraints are addressed again. Often, this process converges; but if it does not, Reluplex performs a case split, which transforms one piecewise-linear constraint into a disjunction of linear constraints. Then, one of the disjuncts is applied and the others are stored, and the solving process continues; and if UNSAT is reached, Reluplex backtracks, removes the disjunct it has applied and applies a different disjunct instead. The process terminates either when one of the search paths returns SAT (the entire query is SAT), or when they all return UNSAT (the entire query is UNSAT). It is desirable to perform as few case splits as possible, as they significantly enlarge the search space to be explored.

The Reluplex algorithm is formally defined as a sound and complete calculus of derivation rules [33]. We omit here the derivation rules aimed at solving the linear constraints, and bring only the rules aimed at addressing the piecewiselinear constraints; specifically, ReLU constraints [33]. These derivation rules are given in Fig. 3, where: (i) X is the set of all variables in the query; (ii) R is the set of all ReLU pairs; i.e., b, f ∈ R implies that it should hold that f = ReLU(b); (iii) α is the current assignment, mapping variables to real values; (iv) l and u map variables to their current lower and upper bounds, respectively; and (v) the update(α, x, v) procedure changes the current assignment α by setting the value of x to v. The ReluCorrect<sup>b</sup> and ReluCorrect<sup>f</sup> rules are used for correcting an assignment in which a ReLU constraint is currently violated, by adjusting either the value of b or f, respectively. The ReluSplit rule transforms a ReLU constraint into a disjunction, by forcing either b's lower bound to be non-negative, or its upper bound to be non-positive. This forces the constraint into either its active phase (the identity function) or its inactive phase (the zero function). In the case when we guess that a ReLU is active, we also apply the addEq operation to add the equation f = b, in order to make sure the ReLU is satisfied in the active phase. The Success rule terminates the search procedure when all variable assignments are within their bounds (i.e., all linear constraints hold), and all ReLU constraints are satisfied. The rule for reaching an UNSAT conclusion is part of the linear constraint derivation rules which are not depicted; see [33] for additional details.

The aforementioned derivation rules describe a search procedure: the solver incrementally constructs a satisfying assignment, and performs case splitting

$$\begin{array}{llll} \text{ReluCorrect}\_{b} & \frac{\langle b,f\rangle \in R, \quad \alpha(f) \neq \text{ReLU}(\alpha(b))}{\alpha := \text{update}(\alpha,b,\alpha(f))} & \text{ReluCorrect}\_{f} & \frac{\langle b,f\rangle \in R, \quad \alpha(f) \neq \text{ReLU}(\alpha(b))}{\alpha := \text{update}(\alpha,f,\text{ReLU}(\alpha(b)))} \\\\ & & & & \\ \text{ReluSplit} & \frac{\langle b,f\rangle \in R}{u(b) := \min(u(b),0),} & & l(b) := \max(l(b),0),\\ & & & & \\ & & \text{s.t.} & (f) := \min(u(f),0) \\\\ \text{Success} & \frac{\forall x \in \mathcal{X}.\ l(x) \leq \alpha(x) \leq u(x), \quad \forall (b,f) \in R.\ \alpha(f) = \text{ReLU}(\alpha(b))}{\text{SAT}} \end{array}$$

Fig. 3: Derivation rules for the Reluplex algorithm (simplified; see [33] for more details).

when needed. Another key ingredient in modern SMT solvers is deduction steps, aimed at narrowing down the search space by ruling out possible case splits. In this context, deductions are aimed at obtaining tighter bounds for variables: i.e., finding greater values for l(x) and smaller values for u(x) for each variable x ∈ X . These bounds can indeed remove case splits by fixing activation functions into one of their phases; for example, if f = ReLU(b) and we deduce that b ≥ 3, we know that the ReLU is in its active phase, and no case split is required. We provide additional details on some of these deduction steps in Section 4.

#### **3 Extending Reluplex to Support Sign Constraints**

In order to extend Reluplex to support sign constraints, we follow a similar approach to how ReLUs are handled. We encode every sign constraint f = sign(b) as two separate variables, f and b. Variable b represents the input to the sign function, whereas f represents the sign's output. In the toy example from Fig. 2, b will represent the assignment for neuron v<sup>1</sup> <sup>3</sup>, and f will represent v<sup>1</sup> 4.

Initially, a sign constraint poses no bound constraints over b, i.e. l(b) = −∞ and u(b) = ∞. Because the values of f are always ±1, we set l(f) = −1 and u(f) = 1. If, during the search and deduction process, tighter bounds are discovered that imply that b ≥ 0 or f > −1, we say that the sign constraint has been fixed to the positive phase; in this case, it can be regarded as a linear constraint, namely b ≥ 0∧f = 1. Likewise, if it is discovered that b < 0 or f < 1, the constraint is fixed to the negative phase, and is regarded as b < 0 ∧ f = −1. If neither case applies, we say that the constraint's phase has not yet been fixed.

In each iteration of the search procedure, a violated constraint is selected and corrected, by altering the variable assignment. A violated sign constraint is corrected by assigning f the appropriate value: −1 if the current assignment of b is negative, and 1 otherwise. Case splits (which are needed to ensure completeness and termination) are handled similarly to the ReLU case: we allow the solver to assert that a sign constraint is in either the positive or negative phase, and then backtrack and flip that assertion if the search hits a dead-end.

More formally, we define this extension to Reluplex by modifying the derivation rules described in Fig. 3 as follows. The rules for handling linear con-


Fig. 4: The extended Reluplex derivation rules, with support for sign constraints.

straints and ReLU constraints are unchanged — the approach is modular and extensible in that sense, as each type of constraint is addressed separately. In Fig. 4, we depict new derivation rules, capable of addressing sign constraints. The SignCorrect<sup>−</sup> and SignCorrect<sup>+</sup> rules allow us to adjust the assignment of <sup>f</sup> to account for the current assignment of b — i.e., set f to −1 if b is negative, and to 1 otherwise. The SignSplit is used for performing a case split on a sign constraint, introducing a disjunction for enforcing that either b is non-negative (l(b) ≥ 0) and f = 1, or b is negative (u(b) ≤ −; epsilon is a small positive constant, chosen to reflect the desired precision) and f = −1. Finally, the Success rule replaces the one from Fig. 3: it requires that all linear, ReLU and sign constraints be satisfied simultaneously.

We demonstrate this process with a simple example. Observe again the toy example for Fig. 2, the pre-condition <sup>P</sup> = (1 <sup>≤</sup> <sup>v</sup><sup>1</sup> <sup>1</sup> <sup>≤</sup> 2)∧(−<sup>1</sup> <sup>≤</sup> <sup>v</sup><sup>2</sup> <sup>1</sup> ≤ 1), and the post-condition Q = (v<sup>1</sup> <sup>5</sup> ≤ 5). Our goal is to find an assignment to the variables {v1 1, v<sup>2</sup> 1, v<sup>1</sup> 2, v<sup>1</sup> 3, v<sup>1</sup> 4, v<sup>1</sup> <sup>5</sup>} that satisfies P, Q, and also the constraints imposed by the BNN itself, namely the weighted sums v<sup>1</sup> <sup>2</sup> = v<sup>1</sup> <sup>1</sup> <sup>−</sup> <sup>v</sup><sup>2</sup> <sup>1</sup> + 1, v<sup>1</sup> <sup>3</sup> = 0.5v<sup>1</sup> <sup>2</sup>, and v1 <sup>5</sup> = 2v<sup>1</sup> <sup>4</sup>, and the sign constraint v<sup>1</sup> <sup>4</sup> = sign(v<sup>1</sup> 3).

Initially, we invoke derivation rules that address the linear constraints (see [33]), and come up with an assignment that satisfies them, depicted as assignment 1 in Fig. 5. However, this assignment violates the sign constraint: v<sup>1</sup> <sup>4</sup> = −1 = sign(v<sup>1</sup> 3) = sign(1) = 1. We can thus invoke the SignCorrect<sup>+</sup> rule, which adjusts the assignment, leading to assignment 2


Fig. 5: An iterative solution for a BNN verification query.

in the figure. The sign constraint is now satisfied, but the linear constraint v1 <sup>5</sup> = 2v<sup>1</sup> <sup>4</sup> is violated. We thus let the solver correct the linear constraints again, this time obtaining assignment 3 in the figure, which satisfies all constraints. The Success rule now applies, and we return SAT and the satisfying variable assignment.

The above-described calculus is sound and complete (assuming the used in the SignSplit rule is sufficiently small): when it answers SAT or UNSAT, that statement is correct, and for any input query there is a sequence of derivation steps that will lead to either SAT or UNSAT. The proof is quite similar to that of the original Reluplex procedure [33], and is omitted. A naive strategy that will always lead to termination is to apply the SignSplit rule to saturation; this effectively transforms the problem into an (exponentially long) sequence of linear programs. Then, each of these linear programs can be solved quickly (linear programming is known to be in P). However, this strategy is typically quite slow. In the next section we discuss how many of these case splits can be avoided by applying multiple optimizations.

#### **4 Optimizations**

**Weighted Sum Layer Elimination.** The SMT-based approach introduces a new variable for each node in a weighted sum layer, and an equation to express that node's value as a weighted sum of nodes from the preceding layer. In BNNs, we often encounter consecutive weighted sum layers — specifically because of the binary block structure, in which a weighted sum layer is followed by a batch normalization layer, which is also encoded as weighted sum layer. Thus, a straightforward way to reduce the number of variables and equations, and hence to expedite the solution process, is to combine two consecutive weighted sum layers into a single layer. Specifically, the original layers can be regarded as transforming input x into y = W2(W<sup>1</sup> · x + B1) + B2, and the simplification as computing y = W<sup>3</sup> · x + B3, where W<sup>3</sup> = W<sup>2</sup> · W<sup>1</sup> and B<sup>3</sup> = W<sup>2</sup> · B<sup>1</sup> + B2. An illustration appears in Fig. 6 (for simplicity, all bias values are assumed to be 0).

Fig. 6: On the left, a (partial) DNN with two consecutive weighted sum layers. On the right, an equivalent DNN with these two layers merged into one.

**LP Relaxation.** Given a constraint f = sign(b), it is beneficial to deduce tighter bounds on the b and f variables — especially if these tighter bounds fix the constraints into one of its linear phases. We thus introduce a preprocessing phase, prior to the invocation of our enhanced Reluplex procedure, in which tighter bounds are computed by invoking a linear programming (LP) solver.

The idea, inspired by similar relaxations for ReLU nodes [14, 49], is to overapproximate each constraint in the network, including sign constraints, as a set of linear constraints. Then, for every variable v in the encoding, an LP solver is used to compute an upper bound u (by maximizing) and a lower bound l (by minimizing) for v. Because the LP encoding is an over-approximation, v is indeed within the range [l, u] for any input to the network.

Let f = sign(b), and suppose we initially know that l ≤ b ≤ u. The linear over-approximation that we introduce for f is a trapezoid (see Fig. 7), with the following edges: (i) <sup>f</sup> <sup>≤</sup> 1; (ii) <sup>f</sup> ≥ −1; (iii) <sup>f</sup> <sup>≤</sup> <sup>2</sup> <sup>−</sup><sup>l</sup> · <sup>b</sup> + 1; and (iv) <sup>f</sup> <sup>≥</sup> <sup>2</sup> <sup>u</sup> · b − 1. It is straightforward to show that these four equations form the smallest convex polytope containing the values of f.

We demonstrate this process on the simple BNN depicted on the left-hand side of Fig. 7. Suppose we know that the input variable, x, is bounded in the range −1 ≤ x ≤ 1, and we wish to compute a lower bound for y. Simple, intervalarithmetic based bound propagation [33] shows that b<sup>1</sup> = 3x+1 is bounded in the range −2 ≤ b<sup>1</sup> ≤ 4, and similarly that b<sup>2</sup> = −4x+ 2 is in the range −2 ≤ b<sup>2</sup> ≤ 6. Because neither b<sup>1</sup> nor b<sup>2</sup> are strictly negative or positive, we only know that −1 ≤ f1, f<sup>2</sup> ≤ 1, and so the best bound obtainable for y is y ≥ −2. However, by formulating the LP relaxation of the problem (right-hand side of Fig. 7), we get the optimal solution <sup>x</sup> <sup>=</sup> <sup>−</sup><sup>1</sup> <sup>3</sup> , b<sup>1</sup> = 0, b<sup>2</sup> <sup>=</sup> <sup>10</sup> <sup>3</sup> , f<sup>1</sup> <sup>=</sup> <sup>−</sup>1, f<sup>2</sup> <sup>=</sup> <sup>1</sup> <sup>9</sup> , y <sup>=</sup> <sup>−</sup><sup>8</sup> <sup>9</sup> , implying the tighter bound <sup>y</sup> ≥ −<sup>8</sup> 9 .

Fig. 7: A simple BNN (left), the trapezoid relaxation of f<sup>1</sup> = sign(b1) (center), and its LP encoding (right). The trapezoid relaxation of f<sup>2</sup> is not depicted.

The aforementioned linear relaxation technique is effective but expensive — because it entails invoking the LP solver twice for each neuron in the BNN encoding. Consequently, in our tool, the technique is applied only once per query, as a preprocessing step. Later, during the search procedure, we apply a related but more lightweight technique, called symbolic bound tightening [52], which we enhanced to support sign constraints.

**Symbolic Bound Tightening.** In symbolic bound tightening, we compute for each neuron v a symbolic lower bound sl(x) and a symbolic upper bound su(x), which are linear combinations of the input neurons. Upper and lower bounds can then be derived from their symbolic counterparts using simple interval arithmetic. For example, suppose the network's input nodes are x<sup>1</sup> and x2, and that for some neuron v we have:

$$ssl(v) = 5x\_1 - 2x\_2 + 3, \quad su(v) = 3x\_1 + 4x\_2 - 1$$

and that the currently known bounds are x<sup>1</sup> ∈ [−1, 2], x<sup>2</sup> ∈ [−1, 1] and v ∈ [−2, 11]. Using the symbolic bounds and the input bounds, we can derive that the upper bound of v is at most 6 + 4 − 1 = 9, and that its lower bound is at least −5 − 2+3= −4. In this case, the upper bound we have discovered for v is tighter than the previous one, and so we can update v's range to be [−2, 9].

The symbolic bound expressions are propagated layer by layer [52]. Propagation through weighted sum layers is straightforward: the symbolic bounds are simply multiplied by the respective edge weights and summed up. Efficient approaches for propagations through ReLU layers have also been proposed [51]. Our contribution here is an extension of these techniques for propagating symbolic bounds also through sign layers. The approach again uses a trapezoid, although a more coarse one — so that we can approximate each neuron from above and below using a single linear expression. More specifically,

Fig. 8: Symbolic bounds for f=sign(b).

for f = sign(b) with b ∈ [l, u] and previously-computed symbolic bounds su(b) and sl(b), the symbolic bounds for f are given by:

$$sl(f) = \frac{2}{u} \cdot sl(b) - 1, \quad su(f) = -\frac{2}{l} \cdot su(b) + 1$$

An illustration appears in Fig. 8. The blue trapezoid is the relaxation we use for the symbolic bound computation, whereas the gray trapezoid is the one used for the LP relaxation discussed previously. The blue trapezoid is larger, and hence leads to looser bounds than the gray trapezoid; but it is computationally cheaper to compute and use, and our evaluation demonstrates its usefulness.

**Polarity-based Splitting.** The Marabou framework supports a parallelized solving mode, using the Split-and-Conquer (S&C) algorithm [54]. At a high level, S&C partitions a verification query φ into a set of sub-queries Φ := {φ1, ...φn}, such that φ and ' <sup>φ</sup>∈<sup>Φ</sup> <sup>φ</sup> are equi-satisfiable, and handles each sub-query independently. Each sub-query is solved with a timeout value; and if that value is reached, the sub-query is again split into additional sub-queries, and each is solved with a greater timeout value. The process repeats until one of the subqueries is determined to be SAT, or until all sub-queries are proven UNSAT.

One Marabou strategy for creating sub-queries is by splitting the ranges of input neurons. For example, if in query φ an input neuron x is bounded in the range x ∈ [0, 4] and φ times out, it might be split into φ<sup>1</sup> and φ<sup>2</sup> such that x ∈ [0, 2] in φ<sup>1</sup> and x ∈ [2, 4] in φ2. This strategy is effective when the neural network being verified has only a few input neurons.

Another way to create sub-queries is to perform case-splits on piecewise-linear constraints — sign constraints, in our case. For instance, given a verification query φ := φ ∧ f = sign(b), we can partition it into φ<sup>−</sup> := φ ∧ b < 0 ∧ f = −1 and <sup>φ</sup><sup>+</sup> := <sup>φ</sup> <sup>∧</sup> <sup>b</sup> <sup>≥</sup> <sup>0</sup> <sup>∧</sup> <sup>f</sup> = 1. Note that <sup>φ</sup> and <sup>φ</sup><sup>+</sup> <sup>∨</sup> <sup>φ</sup><sup>−</sup> are equi-satisfiable.

The heuristics for picking which sign constraint to split on have a significant impact on the difficulty of the resulting sub-problems [54]. Specifically, it is desirable that the sub-queries be easier than the original query, and also that they be balanced in terms of runtime — i.e., we wish to avoid the case where φ<sup>1</sup> is very easy and φ<sup>2</sup> is very hard, as that makes poor use of parallel computing resources. To create easier sub-problems, we propose to split on sign constraints that occur in the earlier layers of the BNN, as that leads to efficient bound propagation when combined with our symbolic bound tightening mechanism. To create balanced sub-problems, we use a metric called polarity, which was proposed in [54] for ReLUs and is extended here to support sign constraints.

**Definition 1.** Given a sign constraint f = sign(b), and the bounds l ≤ b ≤ u, where l < 0, and u > 0, the polarity of the sign constraint is defined as p = <sup>u</sup>+<sup>l</sup> <sup>u</sup>−<sup>l</sup>.

Intuitively, the closer the polarity is to 0, the more balanced the resulting queries will be if we perform a case-split on this constraint. For example, if φ = φ ∧−10 ≤ b ≤ 10 and we create φ<sup>1</sup> = φ ∧−10 ≤ b < 0, φ<sup>2</sup> = φ ∧0 ≤ b ≤ 10, then queries φ<sup>1</sup> and φ<sup>2</sup> are roughly balanced. However, if initially −10 ≤ b ≤ 1, we obtain φ<sup>1</sup> = φ ∧ −10 ≤ b < 0 and φ<sup>2</sup> = φ ∧ 0 ≤ b ≤ 1. In this case, φ<sup>2</sup> might prove significantly easier than φ<sup>1</sup> because the smaller range of b in φ<sup>2</sup> could lead to very effective bound tightening. Consequently, we use a heuristic that picks the sign constraint with the smallest polarity among the first k candidates (in topological order), where k is a configurable parameter. In our experiments, we empirically selected k = 5.

#### **5 Implementation**

We implemented our approach as an extension to Marabou [34], which is an opensource, freely available SMT-based DNN verification framework [2]. Marabou implements the Reluplex algorithm, but with multiple extensions and optimizations — e.g., support for additional activation functions, deduction methods, and parallelization [54]. It has been used for a variety of verification tasks, such as network simplification [19] and optimization [47], verification of video streaming protocols [35], DNN modification [20], adversarial robustness evaluation [9,22,32] verification of recurrent networks [28], and others. However, to date Marabou could not support sign constraints, and thus, could not be used to verify BNNs. Below we describe our main contributions to the code base. Our complete code is available as an artifact accompanying this paper [1], and has also been merged into the main Marabou repository [2].

**Basic Support for Sign Constraints (***SignConstraint.cpp***).** During execution, Marabou maintains a set of piecewise-linear constraints that are part

of the query being solved. To support various activation functions, these constraints are represented using classes that inherit from the abstract Piecewise-LinearConstraint class. Here, we added a new sub-class, SignConstraint, that inherits from PiecewiseLinearConstraint. The methods of this class check whether the piecewise-linear sign constraint is satisfied, and in case it is not — which possible changes to the current assignment could fix the violation. This class' methods also extend Marabou's deduction mechanism for bound tightening.

**Input Interfaces for Sign Constraints (***MarabouNetworkTF.py***).** Marabou supports various input interfaces, most notable of which is the TensorFlow interface, which automatically translates a DNN stored in TensorFlow protobuf or savedModel formats into a Marabou query. As part of our extensions, we enhanced this interface so that it can properly handle BNNs and sign constraints. Additionally, users can create queries using Marabou's native C++ interface, by instantiating the SignConstraint class discussed previously.

**Network-Level Reasoner (***NetworkLevelReasoner.cpp, Layer.cpp, LP-Formulator.cpp***).** The Network-Level Reasoner (NLR) is the part of Marabou that is aware of the topology of the neural network being verified, as opposed to just the individual constraints that comprise it. We extended Marabou's NLR to support sign constraints and implement the optimizations discussed in Section 4. Specifically, one extension that we added allows this class to identify consecutive weighted sum layers and merge them. Another extension creates a linear over-approximation of the network, including the trapezoid-shaped overapproximation of each sign constraint. As part of the symbolic bound propagation process, the NLR traverses the network, layer by layer, each time computing the symbolic bound expressions for each neuron in the current layer.

**Polarity-Based Splitting (***DnCManager.cpp***).** We extended the methods of this class, which is part of Marabou's S&C mechanism, to compute the polarity value of each sign constraint (see Definition 1), based on the current bounds.

### **6 Evaluation**

All the benchmarks described in this section are included in our artifact, and are publicly available online [1].

**Strictly Binarized Networks.** We began by training a strictly binarized network over the MNIST digit recognition dataset.<sup>4</sup> This dataset includes 70,000 images of handwritten digits, each given as a 28 × 28 pixeled image, with normalized brightness values ranging from 0 to 1. The network that we trained has an input layer of size 784, followed by six binary blocks (four blocks of size 50,

<sup>4</sup> http://yann.lecun.com/exdb/mnist/

two blocks of size 10), and a final output layer with 10 neurons. Note that in the first block we omitted the sign layer in order to improve the network's accuracy.<sup>5</sup> The model was trained for 300 epochs using the Larq library [17] and the Adam optimizer [36], achieving 90% accuracy.

After training, we used Larq's export mechanism to save the trained network in a TensorFlow format, and then used our newly added Marabou interface to load it. For our verification queries, we first chose 500 samples from the test set which were classified correctly by the network. Then, we used these samples to formulate adversarial

Fig. 9: An adversarial example for the MNIST network.

robustness queries [33,48]: queries that ask Marabou to find a slightly perturbed input which is misclassified by the network, i.e. is assigned a different label than the original. We formulated 500 queries, constructed from 50 queries for each of ten possible perturbation values δ ∈ {0.1, 0.15, 0.2, 0.3, 0.5, 1, 3, 5, 10, 15} in L<sup>∞</sup> norm, one query per input sample. An UNSAT answer from Marabou indicates that no adversarial perturbation exists (for the specified δ), whereas a SAT answer includes, as the counterexample, an actual perturbation that leads to misclassification. Such adversarial robustness queries are the most widespread verification benchmarks in the literature (e.g., [16,25,33,52]). An example appears in Fig. 9: the image on the left is the original, correctly classified as 1, and the image on the right is the perturbed image discovered by Marabou, misclassified as 3.

Through our experiments we set out to evaluate our tool's performance, and also measured the contribution of each of the features that we introduced: (i) weighted sum (ws) layer elimination; (ii) LP relaxation; (iii) symbolic bound tightening (sbt); and (iv) polarity-based splitting. We thus defined five configurations of the tool: the all category, in which all four features are enabled, and four all-X configurations for X ∈ {ws, lp, sbt, polarity}, indicating that feature X is turned off and the other features are enabled. All five configurations utilized Marabou's parallelization features, except for all-polarity — where instead of polarity-based splitting we used Marabou's default splitting strategy, which splits the input domain in half in each step.

Fig. 10 depicts Marabou's results using each of the five configurations. Each experiment was run on an Intel Xeon E5-2637 v4 CPUs machine, running Ubuntu 16.04 and using eight cores, with a wall-clock timeout of 5,000 seconds. Most notably, the results show the usefulness of polarity-based splitting when compared to Marabou's default splitting strategy: whereas the all-polarity configuration only solved 218 instances, the all configuration solved 458. It also shows that the weighted sum layer elimination feature significantly improves performance, from 436 solved instances in all-ws to 458 solved instances in all, and with significantly faster solving speed. With the remaining two features, namely LP

<sup>5</sup> This is standard practice; see https://docs.larq.dev/larq/guides/ bnn-architecture/

relaxations and symbolic bound tightening, the results are less clear: although the all-lp and all-sbt configurations both slightly outperform the all configuration, indicating that these two features slowed down the solver, we observe that for many instances they do lead to an improvement; see Fig. 11. Specifically, on UNSAT instances, the all configuration was able to solve one more benchmark than either all-lp or all-sbt; and it strictly outperformed all-lp on 13% of the instances, and all-sbt on 21% of the instances. Gaining better insights into the causes for these differences is a work in progress.

Fig. 10: Running the five configurations of Marabou on the MNIST BNN.

Fig. 11: Evaluating the LP relaxation and symbolic bound tightening features.

Max-Pool Sign Convolution

Fig. 12: The XNOR-Net architecture of our

Max-Pool Batch Norm Weighted Sum

A

**XNOR-Net.** XNOR-Net [45] is a BNN architecture for image recognition networks. XNOR-Nets consist of a series of binary convolution blocks, each containing a sign layer, a convolution layer, and a max-pooling layer

(here, we regard convolution layers as a specific case of weighted sum layers). We constructed such a network with two binary convolution blocks: the first block has three layers, including a convolution layer with three filters, and the second block has four layers, including a convolution layer with two filters. The two binary convolution blocks are followed by a batch normalization layer and a fully-connected weighted sum layer (10 neurons) for the network's output, as depicted in Fig. 12. Our network was trained on the Fashion-MNIST dataset, which includes 70,000 images from ten different clothing categories [55], each given as a 28 × 28 pixeled image. The model was trained for 30 epochs, and achieved a modest accuracy of 70.97%.

Input Convolution

network.

For our verification queries, we chose 300 correctly classified samples from the test set, and used them to formulate adversarial robustness queries. Each query was formulated using one sample and a perturbation value δ ∈ {0.05, 0.1, 0.15, 0.2, 0.25, 0.3} in L<sup>∞</sup> norm. Fig. 13 depicts the adversarial image that Marabou produced for one

Fig. 13: An original image (left) and its perturbed, misclassified image (right).

of these queries. The image on the left is a correctly classified image of a shirt, and the image on the right is the perturbed image, now misclassified as a coat.

Based on the results from the previous set of experiments, we used Marabou with weighted sum layer elimination and polarity-based splitting turned on, but with symbolic bound tightening and LP relaxation turned off. Each experiment ran on an Intel Xeon E5-2637 v4 machine, using eight cores and a wall-clock timeout of 7,200 seconds. The results are depicted in Table 1. The results demonstrate that UNSAT queries tended to be solved significantly faster than SAT ones, indicating that Marabou's search procedure for these cases needs further optimization. Overall, Marabou was able to solve 203 out of 300 queries. To the best of our knowledge, this is the first effort to formally verify an XNOR-Net. We note that these results demonstrate the usefulness of an SMT-based approach for BNN verification, as it allows the verification of DNNs with multiple types of activation functions, such as a combination of sign and max-pooling.

#### **7 Related Work**

DNNs have become pervasive in recent years, and the discovery of various faults and errors has given rise to multiple approaches for verifying them. These in-


Table 1: Marabou's performance on the XNOR-Net queries.

clude various SMT-based approaches (e.g., [25, 33, 34, 38]), approaches based on LP and MILP solvers (e.g., [8, 14, 41, 49]), approaches based on symbolic interval propagation or abstract interpretation (e.g., [16,50,52,53]), abstractionrefinement (e.g., [3, 15]), and many others. Most of these lines of work have focused on non-quantized DNNs. Verification of quantized DNNs is PSPACEhard [24], and requires different tools than the ones used for their non-quantized counterparts [18]. Our technique extends an existing line of SMT-based verifiers to support also the sign activation functions needed for verifying BNNs; and these new activations can be combined with various other layers.

Work to date on the verification of BNNs has relied exclusively on reducing the problem to Boolean satisfiability, and has thus been limited to the strictly binarized case [11,29,43,44]. Our approach, in contrast, can be applied to binarized neural networks that include activation functions beyond the sign function, as we have demonstrated by verifying an XNOR-Net. Comparing the performance of Marabou and the SAT-based approaches is left for future work.

### **8 Conclusion**

BNNs are a promising avenue for leveraging deep learning in devices with limited resources. However, it is highly desirable to verify their correctness prior to deployment. Here, we propose an SMT-based verification approach that enables the verification of BNNs. This approach, which we have implemented as part of the Marabou framework [2], seamlessly integrates with the other components of the SMT solver in a modular way. Using Marabou, we have verified, for the first time, a network that uses both binarized and non-binarized layers. In the future, we plan to improve the scalability of our approach, by enhancing it with stronger bound deduction capabilities, based on abstract interpretation [16].

**Acknowledgements.** We thank Nina Narodytska, Kyle Julian, Kai Jia, Leon Overweel and the Plumerai research team for their contributions to this project. The project was partially supported by the Israel Science Foundation (grant number 683/18), the Binational Science Foundation (grant number 2017662), the National Science Foundation (grant number 1814369), and the Center for Interdisciplinary Data Science Research at The Hebrew University of Jerusalem.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### cake\_lpr: Verified Propagation Redundancy Checking in CakeML

Yong Kiam Tan1(-) , Marijn J. H. Heule<sup>1</sup> , and Magnus O. Myreen<sup>2</sup>

<sup>1</sup> Computer Science Department, Carnegie Mellon University, Pittsburgh, USA {yongkiat,mheule}@cs.cmu.edu

<sup>2</sup> Chalmers University of Technology, Gothenburg, Sweden myreen@chalmers.se

Abstract. Modern SAT solvers can emit independently checkable proof certificates to validate their results. The state-of-the-art proof system that allows for compact proof certificates is *propagation redundancy* (PR). However, the only existing method to validate proofs in this system with a formally verified tool requires a transformation to a weaker proof system, which can result in a significant blowup in the size of the proof and increased proof validation time. This paper describes the first approach to formally verify PR proofs on a succinct representation; we present (i) a new *Linear PR* (LPR) proof format, (ii) a tool to efficiently convert PR proofs into LPR format, and (iii) cake\_lpr, a verified LPR proof checker developed in CakeML. The LPR format is backwards compatible with the existing LRAT format, but extends the latter with support for the addition of PR clauses. Moreover, cake\_lpr is verified using CakeML's binary code extraction toolchain, which yields correctness guarantees for its machine code (binary) implementation. This further distinguishes our clausal proof checker from existing ones because unverified extraction and compilation tools are removed from its trusted computing base. We experimentally show that LPR provides efficiency gains over existing proof formats and that the strong correctness guarantees are obtained without significant sacrifice in the performance of the verified executable.

Keywords: linear propagation redundancy · binary code extraction

#### 1 Introduction

Given a formula of propositional logic, the task of a SAT solver is to decide if there exists an assignment that satisfies the formula. Such a *satisfying assignment*, if found by a SAT solver, is easily verifiable by independent checkers and so one does not need to trust the inner workings of the solver. The situation with *unsatisfiable* formulas, i.e., where no satisfying assignment exists, is not as straightforward. Here, SAT solvers must produce an *unsatisfiability proof*. Ideally, the proof system (and proof format) for such proofs should be sufficiently expressive, allowing SAT solvers to efficiently produce proofs that correspond to the SAT solving techniques they use at runtime. At the same time, the resulting proofs ought to be efficiently checkable by independent and trustworthy tools.

<sup>©</sup> The Author(s) 2021 J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 223–241, 2021. https://doi.org/10.1007/978-3-030-72013-1\_12

The de facto standard proof system for propositional unsatisfiability proofs is known as Resolution Asymmetric Tautology (RAT) [24]. The associated DRAT format [36] combines clause addition based on RAT steps and clause deletion. Independent checking tools can validate proofs in the DRAT format; they have been used to check the results of the SAT competitions since 2014 [36] and in industry [15]. Enriching DRAT proofs with hints is the main technique for developing efficient verified proof checkers, e.g., existing verified checkers use the enriched proof formats LRAT [6] and GRAT [28].

A recently proposed proof system, called Propagation Redundancy (PR) [21], generalizes RAT. There exist short PR proofs without new variables for many problems that are hard for resolution, such as pigeonhole formulas, Tseitin problems, and mutilated chessboard problems [19]. Due to the absence of new variables it is easier to find PR proofs automatically [20], and it is considered unlikely that there exist short RAT proofs for these problems that do not introduce new variables nor reuse eliminated variables [21]. Such PR proofs can be checked directly [21], or they can first be transformed into DRAT proofs or even Extended Resolution proofs by introducing new variables [18,25]. In theory, the blowup is small, i.e., polynomial-sized. However, in practice, the transformed proofs can be significantly more expensive to validate compared to the original PR proofs [21].

A natural question arises: why should proof checkers be trusted to correctly check proofs if we do not likewise trust SAT solvers to correctly determine satisfiability? One answer is that proof checkers are much easier to implement so their code can be carefully audited. Another answer is that the algorithms underlying proof checkers have been *formally verified* in a proof assistant [6, 15, 28]. However, to get executable code for these verified checkers, some additional unverified steps are still required. Although unlikely, each of these steps can introduce bugs in the resulting executable: (1) the algorithms are extracted by unverified code generation tools into source code for a programming language; (2) unverified parsing, file I/O, and command-line interface code is added; (3) the combined code is then compiled by unverified compilers down to executable machine code.

The contributions of this paper are: (i) a new Linear PR (henceforth LPR) proof format that enriches PR proofs with hints and is backwards compatible with the LRAT format; (ii) a tool to efficiently enrich PR proofs with hints; and (iii) cake\_lpr, an efficient verified LPR proof checker with correctness guarantees, including for steps (1)–(3) enumerated above. The cake\_lpr tool is publicly available at https://github.com/tanyongkiam/cake\_lpr and it was used to validate the unsatisfiability proofs in the 2020 SAT Competition because of its strong trust story combined with easy compilation and usage. Moreover, the stronger proof system could be supported in future competitions.

Section 3 shows how PR proofs can be enriched to obtain LPR proofs and presents the corresponding LPR proof checking algorithm (Contributions i & ii). Notably, existing LRAT proof checkers can be extended in a clean and minimal way to support LPR proofs. Section 4 explains the implementation of our checker in CakeML, as well as the correctness guarantees and high-level verification strategy behind the proofs (Contribution iii). Section 5 benchmarks our proof format

Table 1. A comparison of SAT proof checkers that have been verified in various proof assistants [6,15,28]. Green background (cells with +) indicates desirable properties, e.g., LPR is based on a stronger proof system than LRAT and GRAT, while red backgrounds (cells with ×) indicate less desirable properties. Yellow backgrounds (cells with −) are also undesirable but to a lesser extent.


and proof checker against existing implementations. A summary comparison of the new proof checker against existing verified proof checkers is in Table 1.

#### 2 Background

This section provides background on CakeML and its related tools. It also recalls the standard problem format and clausal proof systems used by SAT solvers.

#### 2.1 HOL4 and CakeML

HOL4 is a proof assistant implementing classical higher-order logic [34]. CakeML is a programming language *deeply embedded* in HOL4, i.e., its abstract syntax is represented as a HOL datatype and its semantics is formalized within HOL4. Several tools for developing verified CakeML software are used in this work to fill the verification gaps in the correspondingly enumerated items in Section 1:

	- the CakeML proof-producing translator [32] automatically synthesizes verified source code from pure algorithmic specifications;
	- the CakeML characteristic formula (CF) framework [14] provides a separation logic which can be used to manually verify (more efficient) imperative code for performance-critical parts of the proof checker.

The combination of these tools enables *binary code extraction* [27] where verified machine code is extracted directly in HOL4. Several other CakeML-based programs have been verified using these tools, including: certificate checkers for floating-point error bounds [3] and vote counting [13], and an OpenTheory article checker [1]. Œuf provides a similar toolchain in the Coq proof assistant [31].

#### 2.2 SAT Problems and Clausal Proofs

Fix a set of boolean *variables* x1,...,xn, where the negation of variable x<sup>i</sup> is denoted xi, and the negation of x<sup>i</sup> is identified with xi. Variables and their negations are called *literals* and are denoted using l. The input for propositional SAT solvers is a formula F in *conjunctive normal form* (CNF) over the set of variables x1,...,xn. Here, CNF means that F consists of an outer logical conjunction <sup>F</sup> <sup>≡</sup> &<sup>m</sup> <sup>i</sup>=1 Ci, where each *clause* C<sup>i</sup> is a disjunction over some of the literals C<sup>i</sup> ≡ l<sup>i</sup><sup>1</sup> ∨ l<sup>i</sup>2, ···∨ lik. Formulas in CNF can be represented directly as sets of clauses and clauses as sets of literals. The empty clause is denoted ⊥. An *assignment* α assigns boolean values to each variable; α can be *partial*, i.e., it only assigns values to some of the variables. Like formulas and clauses, a (partial) assignment can be represented as the set of literals assigned the boolean value true by that assignment. The negation of an assignment, denoted α, assigns the negation of all literals in α. An assignment α *satisfies* a clause C iff their set intersection is nonempty. Additionally, we define C |α = ! if α satisfies C; otherwise, C |α denotes the result of removing from C all the literals falsified by α, i.e., C |α = C \ α. For a formula F, we define F |α = {C |α | C ∈ F and C |α = !}. Intuitively, F |α contains the remaining clauses in formula F after committing to the partial assignment α.

The task of a SAT solver is to determine whether F is *satisfiable*, i.e., whether there exists a (possibly partial) assignment α such that F |α is empty. Any satisfying assignment can be used as certificate of satisfiability. Formulas without a satisfying assignment are *unsatisfiable*. Certifying unsatisfiability is more difficult and typically uses a *clausal* proof system [21]. The idea behind these proof systems is briefly recalled next, using the key concept of clause redundancy.

Definition 1. *A clause* C *is redundant with respect to formula* F *iff* F ∧ C *and* F *are both satisfiable or both unsatisfiable, i.e., they are satisfiability equivalent.*

A clause C that is redundant for F can be added to F without changing its satisfiability. Clausal proof systems work by successively adding redundant clauses to F until the empty clause ⊥ is added, as illustrated below:

$$F \stackrel{+\text{ redundant}}{\Longrightarrow} F \land C\_1 \stackrel{+\text{ redundant}}{\Longrightarrow} F \land C\_1 \land C\_2 \stackrel{+\text{ redundant}}{\Longrightarrow} \cdots \implies F \land C\_1 \land C\_2 \land \cdots \land \bot$$

Satisfiability is preserved along each =⇒ step because of redundancy, e.g., satisfiability of F implies satisfiability of F ∧ C1. Since the final formula is unsatisfiable, the sequence of redundant clause addition steps C1, C2,..., ⊥ corresponds to a proof of unsatisfiability for F. Deciding clause redundancy is as hard as solving the SAT problem itself because ⊥ is always redundant for unsatisfiable formulas. The difference between clausal proof systems is how the redundancy of a (proposed) redundant clause C is efficiently certified at each proof step.

Many notions of redundancy are based on unit propagation. A *unit clause* is a clause with only one literal. The result of applying the *unit clause rule* to a formula F is the formula F |l where (l) is a unit clause in F. The iterated application of the unit clause rule to a formula F until no unit clauses are left is called *unit propagation*. If unit propagation on F yields the empty clause ⊥, denoted by F &<sup>1</sup> ⊥, we say that F implies ⊥ by unit propagation. The notion of *implied by unit propagation* is also used for regular clauses as follows: F &<sup>1</sup> C iff F ∧ ¬C &<sup>1</sup> ⊥ with ¬C = & <sup>l</sup>∈<sup>C</sup> (l). Observe that <sup>¬</sup><sup>C</sup> can be viewed as a partial assignment that assigns the literals l, for l ∈ C, to true. For a formula G, F &<sup>1</sup> G iff F &<sup>1</sup> C for all C ∈ G. The main clausal proof system used in this paper is based on propagation redundant clauses, which are defined as follows.

Definition 2. *Let* F *be a formula,* C *a nonempty clause, and* α *the smallest assignment that falsifies* C*. Then,* C *is* propagation redundant (PR) *with respect to* F *if there exists an assignment* ω *which satisfies* C *and such that* F |α &<sup>1</sup> F |ω*.*

Intuitively, a PR clause C is redundant because any satisfying assignment for F that does not already satisfy C can be modified to a satisfying assignment for F ∧ C by updating its literals assigned to true according to the (partial) witnessing assignment ω [21]. Propagation redundancy is efficiently checkable in polynomial time using the witnessing assignment and PR generalizes various other notions of clause redundancy, including the de facto standard Resolution Asymmetric Tautology (RAT) proof system (see [21, Theorem 2]) that is able to compactly express all current techniques used in state-of-the-art SAT solvers [24].

In practice, clausal proof formats also contain deletion information to speed up proof validation. Hence, unsatisfiability proofs for formula F are modeled as sequences I1,...,I<sup>n</sup> of *instructions* that either add or delete a clause. An *addition instruction* is a triple a, C, ω, where C is a clause and ω is a (possibly empty) *witnessing assignment*; a *deletion instruction* is a pair d, C where C is a clause. The sequence I1,...,I<sup>n</sup> gives rise to formulas F1,...,F<sup>n</sup> with F<sup>0</sup> = F as follows, where F<sup>j</sup> is the *accumulated formula* up to the j-th instruction:

$$F\_j = \begin{cases} F\_{j-1} \cup \{C\} & \text{if } I\_j \text{ is of the form } \langle \mathfrak{a}, C, \omega \rangle \\ F\_{j-1} \nmid \{C\} & \text{if } I\_j \text{ is of the form } \langle \mathfrak{d}, C \rangle \end{cases}$$

A PR proof of unsatisfiability is *valid* if the last instruction adds the empty clause I<sup>n</sup> = a, ⊥, ∅, and, for all addition instructions I<sup>j</sup> = a, C<sup>j</sup> , ω<sup>j</sup> , it holds that C<sup>j</sup> is PR with respect to F<sup>j</sup>−<sup>1</sup> using witness ω<sup>j</sup> . In case an empty witness is provided for I<sup>j</sup> , then F<sup>j</sup>−<sup>1</sup> &<sup>1</sup> C should hold.

#### 3 Linear Propagation Redundancy

This section describes a new clausal proof format called LPR (short for Linear Propagation Redundancy). The format is designed to allow efficient validation

```
proof = {line}
line = (lpr|delete), "\n"
lpr = id,clause,-
                    witness,"0", idlist, {reduced}, "0"
delete = id, "d",idlist, "0"
reduced = neg,idlist
idlist = {id}
id = pos
lit = pos|neg
pos = "1" | "2" | ...
neg = "−",pos
clause = {lit}
witness = {-
            lit}
```
Fig. 1. The grammar for the LPR format. Additions compared to the LRAT grammar [6] are highlighted in bold.

of PR clauses using a (verified) proof checker. We also enhanced the DPR-trim tool<sup>3</sup> to efficiently add hints to PR proofs, thereby turning them into LPR proofs. Throughout the section, we emphasize how LPR can be viewed as a clean and minimal extension of the existing LRAT proof format, which thereby enables its straightforward implementation in existing LRAT tools.

The most commonly used proof format for SAT solvers is DRAT, which combines deletion with RAT redundancy [36]. DRAT proofs are easy for SAT solvers to emit and top-tier SAT solvers support it, but have some disadvantages for verified proof checking. In particular, checking whether a clause is RAT requires a significant amount of proof search to find the unit clauses necessary for showing the implied-by-unit-propagation property. This complicates verification of the proof checking algorithm and slows down the resulting verified proof checkers. The idea behind the Linear RAT (LRAT) [6, 15] and GRAT [28] formats is to include these unit clauses as hints so that verified proof checkers can follow the hints directly without the need for proof search. The LPR format lifts this idea to allow fast validation of the PR property.

An assignment ω *reduces* a clause C if C |ω ⊂ C and C |ω = !. To check the PR property F |α &<sup>1</sup> F |ω, it suffices to check, for each clause C ∈ F reduced by ω, that F |α &<sup>1</sup> C |ω. Hence, in practice, a smaller ω yields a cheaper PR check. The LPR format extends the PR format by adding, for each clause that is reduced by the witness, a list of all unit clause hints required for showing the implied-by-unit-propagation property. Additionally, in order to point to clauses, the LPR format includes an index for each clause at the beginning of each line. The grammar of the LPR format is shown in Fig. 1.

Our extension to DPR-trim enriches input PR proofs by finding and adding all required unit clause hints. It also shrinks the witness ω where possible: every literal in ω ∩ α is removed as well as any literal in ω that is implied by unit propagation from F |α. The shrinking was shown to be correct [21], but has

<sup>3</sup> LPR hint addition is now part of the public GitHub version available at https://github.com/marijnheule/dpr-trim using the command-line option -L.

#### DIMACS file

LPR proof file


Fig. 2. (Left) The first ten clauses of pigeonhole formula (4 pigeons, 3 holes) in the DIMACS format used by SAT solvers. (Right) The LPR refutation consisting of clausewitness pairs and unit clause hints. The first bold integer in each line is the clause index while other bold integers are the unit clause hints. Dropping the bold integers yields a proof in the PR format. Redundant spaces have been added to improve readability.

not been implemented so far. We observed that the witnesses in the PR proofs produced by SaDiCaL [20] can be substantially compressed using this method.

Fig. 2 (left) shows an example formula in the standard DIMACS problem format. The DIMACS format includes a header line starting with "p cnf " followed by the number of variables and the number of clauses. The non-comment lines (not starting with "c ") represent clauses, and they end with "0". Positive integers denote positive literals, while negative integers denote negative literals. Fig. 2 (right) shows a corresponding proof in LPR format. Deletion lines in LPR are formatted identically to LRAT [6] (not shown here). For clause addition lines, the LPR format only differs from LRAT in case the clause to be added has PR but not RAT redundancy. A clause addition line in LPR format consists of three parts. The first part is the first integer on the line, which denotes the index of the new clause. The second part consists of the clause and the witness; the first group of literals is the clause. The (potentially empty) witness starts from the second occurrence of the first literal of the clause until the first 0 that separates the unit clause hints. The second part exactly matches the PR proof format [21]. The third part (after the first 0) are the unit clause hints, which exactly matches the LRAT format [6].

The checking algorithm for LPR, shown in Fig. 3, overlaps significantly with that for LRAT (see [6, Algorithm 1]). The only differences are Steps 4 and 5.1. In Step 4, the witness is used (if present) instead of always using the first literal in C<sup>j</sup> . In Step 5.1, clauses are skipped if they are satisfied by the witness. Notice that a clause can only be both reduced and satisfied by a witness if the witness consists of at least two literals, while in the LRAT format witnesses always consist of exactly one literal. Note also that the algorithm does not check whether C<sup>j</sup> |ω = !, which is a requirement for PR. This omission is allowed because the first literal in ω in the LPR (and PR) format is the same as the first literal in C<sup>j</sup> .

Input: CNF F = {Ci}<sup>i</sup>∈I and line an LPR step. Output: YES if parsed clause C<sup>j</sup> proved PR for F by , NO otherwise. 1. parse as j, C<sup>j</sup> , *ω<sup>j</sup>* , 0, i<sup>0</sup>, {−i <sup>k</sup>, i<sup>k</sup>}<sup>n</sup> k=1 instantiating variables with (vectors of) positive integers. 2. set α ← ¬C<sup>j</sup> 3. for i ∈ i<sup>0</sup> 3.1. set C <sup>i</sup> ← C<sup>i</sup> |α 3.2. if C <sup>i</sup> = ⊥, return YES 3.3. if C <sup>i</sup> = or |C <sup>i</sup>| ≥ 2, return NO 3.4. set α ← α ∪ C i 4. if *ω<sup>j</sup>* **=** *∅* then set *ω ← ω<sup>j</sup>* else set ω ← (C<sup>j</sup> )<sup>1</sup> (if C<sup>j</sup> = ⊥, return NO) 5. for i ∈ I 5.1. if C<sup>i</sup> is satisfied by *ω* or is not reduced by ω, skip to next iteration of Step 5. 5.2. find k such that i <sup>k</sup> = i (from ) (return NO if no such k exists) 5.3. if <sup>C</sup><sup>i</sup> <sup>|</sup>(<sup>α</sup> \ <sup>ω</sup>) <sup>=</sup> , skip 5.4. set α ← α ∪ (¬C<sup>i</sup> \ ω) 5.5. for m ∈ i<sup>k</sup> 5.5.1. set C <sup>m</sup> ← C<sup>m</sup> |α 5.5.2. if C <sup>m</sup> = ⊥, skip to next iteration of Step 5. 5.5.3. if C <sup>m</sup> = or |C <sup>m</sup>| ≥ 2, return NO 5.5.4. set α ← α ∪ C m 5.6. return NO 6. return YES

Fig. 3. Algorithm to check a single clause addition step in the LPR format. The bold parts show the additions compared to LRAT proof checking [6].

#### 4 CakeML Proof Checking

This section explains the implementation and verification of cake\_lpr, our verified CakeML LPR proof checker. Section 4.1 focuses on the high-level verification strategy which we used to reduce the verification task to mostly routine low-level proofs (the latter details are omitted). Section 4.2 highlights important verified performance optimizations used in the proof checker.

#### 4.1 Verification Strategy

The development of cake\_lpr proceeds in three refinement steps, where each step progressively produces a more concrete and performant implementation of the proof checker. These refinements are visualized in the three columns of Fig. 4.

Step 1 formalizes the definition of CNF formulas and their unsatisfiability, as well as the PR proof system described in Section 2.2. The inputs and outputs to

Fig. 4. The three step refinement used in the development of cake\_lpr.

the proof system are abstract and not tied to any concrete representation at this step. For example, input variables are drawn from an arbitrary type α, clauses and CNFs are represented using sets. The correctness of the PR proof system is proved in this step, i.e., we show that a valid PR proof implies unsatisfiability of the input CNF. The proof essentially follows [21, Theorem 1].

Step 2 implements a purely functional version of the LPR proof checking algorithm from Fig. 3. Here, the inputs and outputs are given concrete representations with computable datatypes, e.g., literals are integers (similar to DIMACS), clauses are lists of integers, and CNFs are lists of clauses. These concrete representations lift naturally to the abstract, set-based representation from Step 1. The output is a YES or NO answer according to the algorithm from Fig. 3. The correctness theorem for Step 2 shows that LPR proof checking correctly refines the PR proof system, i.e., if it outputs YES, then there exists a valid PR proof for the input (lifted) CNF; by Step 1, this implies that the CNF is unsatisfiable.<sup>4</sup>

Step 3 uses imperative features available in the CakeML source language, e.g., (byte) arrays and exceptions, to improve code performance; these optimizations are detailed further in Section 4.2. This step also adds user interface features like parsing and file I/O so that the input CNF formula is read (and parsed) from a file, and the results are printed on the standard output and error streams. The verification of this step uses CakeML's proof-producing translator [32] and characteristic formula framework [14] to prove the correctness of the source code implementation of cake\_lpr; this code is subsequently compiled with the verified CakeML compiler. Composing the correctness theorem for source cake\_lpr with CakeML's compiler correctness theorem yields the corresponding correctness theorem for the cake\_lpr binary. The final correctness theorem is given in Appendix A. Briefly, it shows that if the cake\_lpr executable prints the string "s VERIFIED UNSAT\n" to the standard output stream (in CakeML's FFI model [10]), then the input (parsed) DIMACS file is an unsatisfiable CNF.

<sup>4</sup> If the output is NO, the input CNF could still be unsatisfiable, but the input LPR proof is not valid according to the algorithm in Fig. 3.

#### 4.2 Verified Optimizations

To minimize verification effort, CakeML's imperative features are only used for the most performance-critical steps of cake\_lpr. Our design decisions are based on empirical observations about the LPR proof checking algorithm. These are explained below with reference to specific steps in the algorithm from Fig. 3.

Array-based representations. In practice, many LPR proof steps do not require the full strength of a PR (or RAT) clause. Hence, a large part of proof checking time is spent in the Step 3 loop of the algorithm and it is important to compute the main loop bottleneck, C<sup>i</sup> |α in Step 3.1, as efficiently as possible. CakeML's native byte arrays are used to maintain a compact bitset-like representation of the assignment α, so that C<sup>i</sup> |α can be computed in one pass over C<sup>i</sup> with constant time bitset lookup for each literal in Ci.

For proof steps requiring the full strength of PR clauses, Step 5 loops over all undeleted clauses in the formula. Formulas are represented as an array of clauses<sup>5</sup> together with a lazily updated list that tracks all indices of the array containing undeleted clauses. This enables both constant-time lookup of clauses throughout the algorithm and fast iteration over the undeleted clauses for Step 5. Deletion in the index list is done in (amortized) constant time by removing a deleted index only when the index is looked up in Step 5.1. Additionally, for each literal, the smallest clause index where that literal occurs (if any) is lazily tracked in a lookup array; for a given witness ω, all clauses occurring at indices below the index of any literal in ω can be skipped in Step 5.1.

Proof checking exceptions. There are several steps in the proof checking algorithm that can fail (report NO) if the input proof is invalid, e.g., in Step 3.3. In a purely functional implementation, results are represented with an option: None indicating a failure and Some res indicating success with result res. While conceptually simple, this means that common case (successful) intermediate results are always boxed within an option and then immediately unboxed with pattern matching to be used again. In cake\_lpr, failures instead raise exceptions which are directly handled at the top level. Thus, successful results can be passed directly, i.e., as res, without any boxing. Support for verifying the use of exceptions is a unique feature of CakeML's CF framework [14].

Buffered I/O streams. Proof files generated by SAT solvers can be large, e.g., ranging from 300 MB to 4 GB for the second benchmark suite in Section 5. These files are streamed into memory line by line because each proof step depends only on information contained in its corresponding line in the file. This streaming interaction is optimized using CakeML's verified buffered I/O library [29] which maintains an internal buffer of yet-to-be-read bytes from the read-only proof file to batch and minimize the number of expensive filesystem I/O calls.

<sup>5</sup> Deleted clauses are no longer referenced by the array and are automatically freed by CakeML's garbage collector.

#### 5 Benchmarks

This section compares the verified CakeML LPR proof checker against other verified checkers on two benchmark suites and a RAT microbenchmark. The first suite is a collection of problems with PR proofs generated by the *satisfactiondriven clause learning* (SDCL) solver SaDiCaL [20], while the second suite consists of unsatisfiable problems from the SAT Race 2019 competition.<sup>6</sup> The RAT microbenchmark consists of proofs for large mutilated chessboards generated by a BDD-based SAT solver [5]. The CakeML checker is labeled cake\_lpr (default 4GB heap and stack space), while other checkers used are labeled acl2-lrat (verified in ACL2 [15]), coq-lrat (verified in Coq [6]), and GRATchk (verified in Isabelle/HOL [28]) respectively. All experiments were run on identical nodes with Intel Xeon E5-2695 v3 CPUs (35M cache, 2.30GHz) and 128GB RAM. Configuration options specific to each benchmark suite are reported below.

#### 5.1 SaDiCaL **PR** Benchmarks

The SaDiCaL solver produces PR proofs for hard SAT problems in its benchmark suite [20] and it is experimentally much faster than a plain DRAT-based CDCL solver on those problems [20, Section 7]. The PR proofs are directly checked by cake\_lpr after conversion into LPR format with DPR-trim. For all other checkers, the PR proofs were first converted to DRAT format using pr2drat (as in the earlier approach [20]), and then into LRAT and GRAT formats using the DRAT-trim and GRATgen<sup>7</sup> tools respectively. All tools were ran with a timeout of 10000 seconds and all timings are reported in seconds (to one d.p.). Results are summarized in Tables 2 and 3.

All benchmarks were successfully solved by SaDiCaL except mchess19 which exceeded the time limit. For the remaining benchmarks, generating and checking LPR proofs required a comparable (1–2.5x) amount of time to solving the problems, except mchess, for which LPR generation and checking is much faster than solving (Table 2). Unsurprisingly, direct checking of LPR proofs is *much faster* than the circuitous route of converting into DRAT and then into either LRAT or GRAT (Table 3). Unlike LPR, checking PR proofs via the LRAT route is 5–60x slower than solving those problems; this is a significant drawback to using the route in practice for certifying solver results.

The backwards compatibility of cake\_lpr is also shown in Table 3, where it is used to check the generated LRAT proofs. Among the LRAT checkers, acl2-lrat is fastest, followed by cake\_lpr (LRAT checking), and coq-lrat. Although cake\_lpr (LRAT checking) is on average 1.3x slower than acl2-lrat, it scales better on the mchess problems and is actually much faster than acl2-lrat on mchess18. We also observed that the GRAT toolchain (summing SaDiCaL, pr2drat, GRATgen and GRATchk times) is much slower than the LRAT toolchains

<sup>6</sup> The suites are available at http://fmv.jku.at/sadical/ and http://sat-race-2019.ciirc. cvut.cz/ respectively.

<sup>7</sup> GRATgen, the only tool that supports parallelism, was ran with 8 threads.

Table 2. Timings for PR benchmarks with conversion into LPR format. The "Total (LPR)" column sums the generation and checking times. The timing for mchess19 is omitted because SaDiCaL timed out; timings for the Urquhart U.-s3-\* benchmarks are omitted because they took a negligible amount of time (< 1.0s total).


Table 3. Timings for PR benchmarks, first converted to DRAT and subsequently converted into LRAT and GRAT formats. The "Total (LRAT)" and "Total (GRAT)" columns sum the fastest generation and checking times for the LRAT and GRAT formats respectively. The "Total (LPR)" column (in bold, fastest total time) is reproduced from Table 2 for ease of comparison. Fail(T) indicates a timeout. Timings for the mchess19 and U.-s3-\* benchmarks are omitted as in Table 2.


(summing SaDiCaL, pr2drat, DRAT-trim and fastest LRAT checking times). This is in contrast to the SAT Race 2019 benchmarks below (Fig. 5), where we observed the opposite relationship. We believe that the difference in checking speed is due to the various checkers having different optimizations for checking the expensive RAT proof steps produced by conversion from PR proofs.

#### 5.2 SAT Race 2019 Benchmarks

We further benchmarked the verified checkers on a suite of 117 unsatisfiable problems from the SAT Race 2019 competition. For all problems, DRAT proofs were generated using the state-of-the-art SAT solver CaDiCaL before conversion into the LRAT or GRAT formats. Notably, proofs generated by CaDiCaL on this

Table 4. A summary of the SAT Race 2019 benchmark results. The N/A row counts problems that timed out or failed in an earlier step of the respective toolchains.

Fig. 5. (Top) Total SAT Race 2019 proofs checked within a given (per instance) time limit for the LRAT proof checkers. (Bottom) Total SAT Race 2019 proofs generated and checked within a given (per instance) time limit for the LRAT and GRAT toolchains.

suite rarely require RAT (or PR) steps, so the checkers are stress-tested on their implementation of file I/O, parsing, and Step 3.1 from Fig. 3; cake\_lpr is the *only* tool with a formally verified implementation of the former two steps. All tools were ran with the SAT competition standard timeout of 5000 seconds.

A summary of the results is given in Table 4. All proofs generated by CaDiCaL were checked by at least one checker. The acl2-lrat checker fails with a parse error on one problem even though none of the other checkers reported such an error; GRATgen aborted on two problems for an unknown reason. Plots comparing LRAT proof checking time and overall proof generation and checking time (LRAT and GRAT) are shown in Fig. 5. From Fig. 5 (top), the relative order of LRAT checking speeds remains the same, where cake\_lpr is on average 1.2x slower than acl2-lrat, although cake\_lpr is faster on 28 bench-


mchess100 3599.0 9.3 44.2 Fail(T) Fail(T) 9506092 499

Table 5. Timings for the RAT microbenchmark. The number of proof steps and file size of the proofs (in MB) are shown in the last two columns. Fail(T) indicates a timeout.

marks. From Fig. 5 (bottom), both LRAT toolchains are slower than the GRAT toolchain (average 3.5 times slower for cake\_lpr and 3.4 times for acl2-lrat). Part of the speedup for GRAT comes from GRATgen, which is the only tool that can be ran in parallel (with 8 threads). This suggests that adding native support for GRAT-based input to cake\_lpr could be a worthwhile future extension.

#### 5.3 Mutilated Chessboard **RAT** Microbenchmarks

The final microbenchmark suite tests the LRAT checkers on large mutilated chessboard problem instances (up to 100 by 100) solved by pgbdd, a BDD-based SAT solver [5]. Unlike the previous two suites, LRAT proofs are emitted *directly* by the solver so additional DRAT-trim conversion is not needed. All tools were ran with a timeout of 10000 seconds and all timings are reported in seconds (to one d.p.). For additional scaling comparison, we also report results for lrat-check, an *unverified* LRAT proof checker implemented in C.

The results in Table 5 show the impact of cake\_lpr's RAT optimizations (Section 4.2). Notably, cake\_lpr scales essentially *linearly* in the size of the proofs (up to ≈ 10 million proof steps). As a result, cake\_lpr is significantly faster than acl2-lrat and coq-lrat on these RAT-heavy proofs and it comes within a 5x factor of the unverified lrat-check tool.

#### 6 Related Work

*Verified Proof Checking.* There are several RAT-based verified proof checkers, in ACL2 [15], Coq [6], and Isabelle/HOL [28]. All three checkers are based on extensions of DRAT, which is itself an extension of the DRUP format [16]; the Coq checker is based on a predecessor for the GRIT [7] format. The ACL2 checker can be efficiently and *directly executed* (without extraction) using imperative primitives native to the ACL2 kernel [15]. However, the implementation of these features in ACL2 itself must be trusted to trust the proof checking results, hence the yellow background in Table 1. SMTCoq [2, 9] is another certificate-based checker for SAT and SMT problems in Coq. Its resolution-based proof certificates can be checked natively using native computation extensions of the Coq kernel.

*Applications.* SAT solving is a key technology underlying many software and hardware verification domains [4, 23]. Certifying SAT results adds a layer of trust and is clearly a worthwhile endeavor. Solver-aided mathematical results [17, 22, 26] are particularly interesting and challenging to certify because these often feature complicated SAT encodings, custom (hand-crafted) proof steps, and enormous resulting proofs [22]. Our cake\_lpr checker can handle the latter two challenges effectively. For the first challenge, the SAT encoding of mathematical problems can also be verified within proof assistants. This was demonstrated for the Boolean Pythagorean Triples problem building on the Coq proof checker [8].

*Verified SAT Solving.* An alternative to proof checking is to verify the SAT solvers [11, 12, 30, 33]. This is a significant undertaking but it would allow the pipeline of generating and checking proofs to be entirely bypassed. Furthermore, such verification efforts can yield new insights about key invariants underlying SAT solving techniques compared to prior pen-and-paper presentations, e.g., the 2WL invariant [12]. However, the performance of verified SAT solvers are not yet competitive with modern (unverified) SAT solving technology [11, 12].

#### 7 Conclusion

This work presents the new LPR proof format for verified checking of PR proofs. It demonstrates the feasibility of using binary code extraction to verify a performant LPR proof checker, cake\_lpr, down to its machine code implementation.

Given the strength of the PR proof system, there is ongoing research into the design of *satisfaction-driven clause learning* techniques [20, 21] for SAT solvers based on PR clauses. Our proof checker opens up the possibility of using a verified checker to help check and debug the implementation of these new techniques. It also gives future SAT competitions the option of providing PR as the default (verified) proof system for participating solvers.

Acknowledgments. We thank Jasmin Blanchette and the anonymous reviewers for their helpful feedback on earlier drafts of this paper, Peter Lammich for help with GRATgen, and Stefan O'Rear for help with profiling CakeML programs.

The first author was supported by A\*STAR, Singapore, the second author was supported by the National Science Foundation (NSF) under grant CCF-2010951, and the third author was supported by the Swedish Foundation for Strategic Research, Sweden. This work was also supported by NSF award number ACI-1445606 at the Pittsburgh Supercomputing Center (PSC).

#### A Correctness Theorem for cake\_lpr

The correctness theorem for cake\_lpr verified in HOL4 is shown in Fig. 6. The assumptions (1) (in red) are routine for compiled CakeML programs that use its basis library. The first line assumes that the command-line cl and file system fs models are well-formed. The second line assumes that the compiled code is correctly placed into (code) memory according to CakeML's x64 machine model.

Fig. 6. The end-to-end correctness theorem for the CakeML LPR proof checker.

The first guarantee (2) (in blue) is that the machine code implementation always terminates normally according to CakeML's x64 machine code semantics. In particular, the code never crashes and may emit some I/O events when run; however, it possibly terminates with an out-of-memory error (extend\_with\_resource\_limit) when CakeML runs out of stack or heap space.

The main correctness guarantee for cake\_lpr is (3) (in green) and (4) (in black). Briefly, (3) says that the only observable change to the filesystem after executing cake\_lpr are strings printed on standard output out and standard error err . According to (3), if the string "s VERIFIED UNSAT\n" is printed onto standard output, then cake\_lpr was provided with a file (in its first commandline argument), and the file parses in DIMACS format to a formula fml which is unsatisfiable. The remaining else case (4), says that the only other possibilities for standard output are either (i) a printed version of the parsed DIMACS file (if no LPR proof file is provided), or (ii) the empty string. All other error messages are printed onto standard error.

In addition, the DIMACS parser (parse\_dimacs) is proved to be left inverse to the DIMACS printer (print\_dimacs) in the following sense:

```
 wf_fml fml ⇒
 ∃ mv fml
           .
   parse_dimacs (print_dimacs fml) = Some (mv,fml
                                                    ) ∧ interp fml = interp fml
```
Briefly, this says that for any well-formed formula fml, printing that formula into DIMACS format then parsing it yields another formula fml which is guaranteed to have the same interpretation according to the semantics of CNFs formalized in HOL4. All parsed formulas are well-formed (not shown here).

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Deductive Verification of Floating-Point Java Programs in KeY

Rosa Abbasi<sup>1</sup> (-), Jonas Schiffl<sup>2</sup> , Eva Darulova<sup>1</sup> , Mattias Ulbrich<sup>2</sup> , and Wolfgang Ahrendt<sup>3</sup>

<sup>1</sup> MPI-SWS, Kaiserslautern and Saarbrücken, Germany, {rosaabbasi,eva}@mpi-sws.org <sup>2</sup> Karlsruhe Institute of Technology, Karlsruhe, Germany, {jonas.schiffl,ulbrich}@kit.edu

<sup>3</sup> Chalmers University of Technology, Göteborg, Sweden, ahrendt@chalmers.se

Abstract. Deductive verification has been successful in verifying interesting properties of real-world programs. One notable gap is the limited support for floating-point reasoning. This is unfortunate, as floating-point arithmetic is particularly unintuitive to reason about due to rounding as well as the presence of the special values infinity and 'Not a Number' (NaN). In this paper, we present the first floating-point support in a deductive verification tool for the Java programming language. Our support in the KeY verifier handles arithmetic via floating-point decision procedures inside SMT solvers and transcendental functions via axiomatization. We evaluate this integration on new benchmarks, and show that this approach is powerful enough to prove the absence of floating-point special values—often a prerequisite for further reasoning about numerical computations—as well as certain functional properties for realistic benchmarks.

Keywords: Deductive Verification · Floating-point Arithmetic · Transcendental Functions.

#### 1 Introduction

Deductive verification has been successful in providing functional verification for programs written in popular programming languages such as Java [4, 23, 41, 49], Python [29], Rust [6], C [25, 54], and Ada [19, 50]. Deductive verifiers allow a user to annotate methods in a program with pre- and postconditions, from which they automatically generate verification conditions (VCs). These are then either proven directly by the verifier itself, or discharged with external tools such as automated (SMT) solvers or interactive proof assistants.

While deductive verifiers fully implement many sophisticated data representations (including heap data structures, objects, and ownership), support for floating-point numbers remains rather limited – solely Frama-C and SPARK offer automated support for floating-point arithmetic in C and Ada [32]. This state of affairs is at least partially a result of previous limitations in floating-point support in SMT solvers. Consequently, deductive verification has been used for

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 242–261, 2021. https://doi.org/10.1007/978-3-030-72013-1\_13

floating-point programs only by experts with considerable manual effort [15, 32]. This is unfortunate as it makes deductive verification unavailable for a large number of programs across many domains including embedded systems, machine learning, and scientific computing. With the increasing need for parallelization in code, scientific computing specifically has recently experienced algorithmic challenges for which formal methods may contribute to a solution [10, 56].

One of the main challenges of floating-point arithmetic is its unintuitive behavior and the special values that the IEEE 754 standard [39] introduces. For instance, an overflow or a division by zero results in the special value (positive or negative) *infinity*, and not a runtime exception. Similarly, invalid operations like sqrt(-1.0) result in a *Not a Number* (NaN) value. These special values are problematic as seemingly straight-forward identities do not hold (x == x or <sup>x</sup> \* 0.0 == 0.0). In addition, every operation on floating-point numbers potentially involves rounding, which compromises familiar rules like associativity and distributivity. Hence, reasoning support for writing correct floating-point programs is indispensable.

Abstract interpretation-based tools can prove the absence of runtime errors and special values [20, 43], and bound roundoff errors due to floating-point's finite precision [11, 21, 26, 36, 57]. SMT decision procedures [18] or SAT-based model-checking [24,56], on the other hand, can prove intricate properties requiring bit-precise reasoning. However, these techniques and tools largely support only purely floating-point programs or program snippets, or analyze programs only up to a predefined depth of the call stack. General reasoning about real-world object-oriented programs, however, also requires support for features such as the (unbounded) heap, necessitating different analyses which need to be combined with floating-point reasoning.

Handling floating-points in a deductive verifier has unique advantages. First, the deductive verification approach already comes with the infrastructure for reasoning about complex control and data structures (like exception handling and heap). Second, it allows one to flexibly combine the verifier's symbolic execution reasoning with external decision procedures. Third, depending on the theory support, the verifier or external solver may also generate counterexamples of a property and thus help program debugging – something an abstract interpretationbased approach fundamentally cannot provide.

We report on adding floating-point support to the KeY deductive verifier, providing the first automated deductive floating-point support for the Java programming language. We focus mainly on proving the absence of the special values infinity and NaN. While these are helpful in certain circumstances, for most applications they signal an error. Hence, showing their absence is a prerequisite for further (functional) reasoning. That said, our extension also allows one to express and discharge arbitrary functional properties expressible in floating-point arithmetic, including bounds on roundoff errors for certain programs, and bounds on differences between two similar floating-point programs

We exploit both KeY's symbolic execution and external SMT support. On the one hand, we handle arithmetic operations by relying on a combination of KeY's symbolic execution to handle the heap and SMT based decision procedures to handle the floating-point part of the VCs. On the other hand, we support transcendental functions via axiomatization in the KeY prover itself.

Transcendental functions such as sine are a common feature in numerical programs, but are not supported by floating-point decision procedures. We explore two ways of supporting them soundly but approximately, by encoding them as axiomatized uninterpreted function symbols once directly in the SMT queries, and once in additional calculus rules in KeY. Our evaluation shows that even though such reasoning is approximate, it is nonetheless sufficient to prove the absence of special values in many interesting programs.

We evaluate KeY's floating-point support on a number of real-world floatingpoint Java programs. Our benchmark set allows us to evaluate recent progress in SMT floating-point support in Z3 [28], CVC4 [8] and MathSAT [22] on yet unseen benchmarks. For instance, we observe that quantifiers are challenging even if they do not affect satisfiability of SMT queries. Our benchmarks are openly available, and we expect our insights to be useful for further solver development.

*Contributions* In summary, we make the following contributions:


#### 2 Background

#### 2.1 Introduction to KeY

KeY [4] is a platform for deductive verification of Java programs, working at a source code level. The input is a Java program annotated in the Java Modeling Language (JML) [45], encouraging a *Design by Contract* ([46, 51]) approach to software development. The user specifies the expected behavior of Java classes with *class invariants* that the program has to maintain at critical points. Methods are specified with *method contracts*, consisting mainly of pre- and postconditions, with the understanding that if the precondition holds when the method is called, the postcondition has to hold after the method returns.

After loading an annotated program, KeY translates it to a formula in Java Dynamic Logic [4] (JavaDL), an instance of Dynamic Logic [37] which enables logical reasoning about Java programs. Logical rules are provided for the translation of programs into first-order logic, and for closing the resulting *goals*, or proof obligations. KeY is semi-interactive in that it allows manual rule application, while also offering powerful built-in automation and macros. In addition, it is also possible to translate an open goal into SMT-LIB format [9] and call an external SMT solver. For specific theories, SMT solvers can be much more efficient than KeY's own automation. This makes it possible to prove some goals, which depend on SMT supported theories, by using an SMT solver, while others are proved internally, using KeY's own automation.

#### 2.2 Floating-Point Arithmetic in Java

In the following, we summarize some central characteristics of Java floating-point numbers, loosely following [53]. Each *normal* floating-point number x can be represented as a triplet (s, m, e), such that <sup>x</sup> = (−1)<sup>s</sup> <sup>∗</sup> <sup>m</sup> <sup>∗</sup> <sup>2</sup><sup>e</sup>, where <sup>s</sup> ∈ {0, <sup>1</sup>} is the *sign*, m (called *significand*) is a binary fixed-point number with one digit before the radix point and p−1 digits after the radix point (note that 0 ≤ m < 2), and e (*exponent*) is an integer such that emin ≤ e ≤ emax. Java supports two floating-point formats (both in base 2): float ('single') precision with p = 24, and minimal and maximal exponent emin = −126, emax = 127 and double precision with p = 53, emin = −1022, emax = 1023.

Whenever the result of a computation cannot be exactly represented with the given precision, it is rounded. IEEE 754 defines various rounding modes, of which Java only supports *round to nearest, ties to even*. Rounding is exact, as if one would first compute the ideal real number, and round afterwards.

The triple representation gives us two zeros, +0 and −0, represented by (0, 0, 0) and (1, 0, 0), respectively. If the absolute value of the ideal result of a computation is too small to be representable as a floating-point number of the given format, the resulting floating point number is +0 or −0. In addition, there are three special values, +∞, −∞, and NaN (Not a Number). If the absolute value of the ideal result of a computation is too big to be representable as a floating-point number of the given format, the result is +∞ or −∞. Also, division by zero will give an infinite result (e.g., 7.13/+0 = +∞). Computing further with infinity may give an infinite result (e.g., +∞ + +∞ = +∞), but may also result in the additional 'error value' NaN (e.g., +∞ −+∞ = NaN). Due to the presence of infinities and NaN, floating-point operations do *not* throw Java exceptions.

By default, the Java virtual machine is allowed to make use of higher-precision formats provided by the hardware. This can make computation more accurate, but it also leads to platform dependent behaviour. This can be avoided by using the strictfp modifier, ensuring that only the single and double precision types are used. This modifier ensures portability.

#### 3 Floating-Point Support in KeY

#### 3.1 Arithmetics

In order to be able to specify and verify programs containing floating-point numbers, we made several extensions to the KeY tool. First, we added the float

```
Listing 1.1: The Rectangle.scale benchmark
```

```
/*@ public normal_behavior
  @ requires \fp_nice(arg0.x) && \fp_nice(arg0.y)
  @ && \fp_nice(arg1) && \fp_nice(arg2);
  @ ensures !\fp_nan(\result.x) && !\fp_nan(\result.y) &&
  @ !\fp_nan(\result.width) && !\fp_nan(\result.height);
  @ also
  @ public normal_behavior
  @ requires -5.53 <= arg0.x && arg0.x <= -3.38 &&
  @ -5.53 <= arg0.y && arg0.y <= -3.38 &&
  @ 3.1 < arg0.width && arg0.width <= 3.7332 &&
  @ 3.0000001 < arg0.height && arg0.height <=4.0004 &&
  @ 3.0003001 < arg1 && arg1 <= 4.0024 &&
  @ -6.4000003 < arg2 && arg2 <= 3.0001;
  @ ensures !\fp_nan(\result.x) && !\fp_nan(\result.y)&&
  @ !\fp_nan(\result.width) &&!\fp_nan(\result.height);
  @*/
public Rectangle scale(Rectangle arg0, double arg1, double arg2){
 Area v1 = new Area(arg0);
 AffineTransform v2 = AffineTransform.getScaleInstance(arg1, arg2);
 Area v3 = v1.createTransformedArea(v2);
 Rectangle v4 = v3.getRectangle2D();
 return v4;
}
```
and double types to the KeY type system, together with an enum type for the different rounding modes of the IEEE 754 Standard.

We further introduced functions and predicate symbols to formalize operations (+, \*, . . . ) and comparisons (<, ==, . . . ) on floating-point expressions. The translation supports both code with and without the strictfp modifier. However, since the actual precision of non-strictfp operations is not known, the function symbols remain uninterpreted. We extended KeY's parser to correctly handle programs and annotations containing floating-point numbers, and added logic rules for translating floating-point expressions from Java or JML to JavaDL.

As an example, Listing 1.1 shows JML specifications of our Rectangle benchmark that contains floating-point literals and makes use of the fp\_nan and fp\_nice predicates. fp\_nan states that a floating-point expression is NaN and fp\_nice, which is shorthand for "not infinity and not NaN", states that a floating-point expression is not NaN or infinity. The scale method contains two contracts that are checked separately, ensuring that the class fields of a scaled rectangle object are not NaN, considering different preconditions. For the first contract, the SMT solver produces a counterexample. In the second, we bound inputs by concrete ranges that we picked arbitrarily and get the valid result. In practice, such ranges would come from the context, e.g. from the kind of rectangles that appear in an application, or from known ranges of sensor values.

Concerning discharging the resulting proof obligations, there were two main ways to consider. One is to create a floating-point theory within KeY by adding axioms and deduction rules, so that the desired properties can be proven in KeY's sequent calculus. The other way is to translate the proof obligations from JavaDL to SMT-LIB and call an external SMT solver. While the KeY approach traditionally favors conducting proofs within KeY, for this work, we partially deviated from this way in order to harness the greater experience and efficiency of SMT solvers when it comes to floating-point arithmetic. Our approach attempts to get the best of both worlds by distinguishing between basic floating-point arithmetic, i. e., elementary operations and comparisons, and more complex functions which do not have an SMT-LIB equivalent (e. g., the transcendental functions), or where the SMT-LIB function is not usefully implemented by current SMT solvers (see Section 3.2.B).

Elementary operations and comparisons get translated to the corresponding SMT-LIB functions. In SMT-LIB, all floating-point computations conform to the IEEE 754 Standard. Therefore, only Java programs with the strictfp modifier can be directly translated to SMT-LIB without loss of correctness.

We developed a translation from KeY's floating-point theory to SMT-LIB. In order to integrate it into KeY, we also overhauled the existing translation from JavaDL to SMT-LIB to create a new, more modular framework, which now supports all the features of the original translation, e. g., heaps and integer arithmetic, but also floating-point expressions at the same time.

Floating-point intricacies sometimes require extra caution. For example, there are two different notions of equality for floats: bitwise equality and IEEE754 equality. Our implementation ensures these are distinguished correctly, and that the specification language remains intuitive for a developer to use.

Using the translation to SMT-LIB, we can specify and prove two classes of properties in KeY: The absence of special values is specified using the fp\_nan and fp\_infinite predicates (or the fp\_nice equivalent). Furthermore, one can specify *functional* properties that are expressible in floating-point arithmetic, e.g. one can compare the result of a computation against the result of a different program which is known to produce a good result or a reference value.

#### 3.2 Transcendental Functions

Floating-point decision procedures in SMT solvers successfully handle programs consisting of arithmetic and square root operations. Many numerical real-world programs, however, include transcendental functions such as sin and cos. In Java programs, these functions are implemented as static library functions in the class java.lang.Math.

Unlike arithmetic operations, transcendental functions are much more loosely specified by the IEEE 754 Standard—only an upper bound on the roundoff error is given. Libraries are thus free to provide different implementations, and even tighter error bounds. Exact reasoning in the same spirit as floating-point arithmetic would thus have to encode a specific implementation. Given that these implementations are highly optimized, this approach would be arguably complex. We observe, however, that such exact reasoning about transcendental functions is often not necessary and a sound approximate approach is sufficient and efficient.

In this section, we introduce an axiomatic approach for reasoning about programs containing transcendental functions. We observe that with the flexibility of deductive verification and KeY itself, we can instantiate it in two different ways. We encode transcendental functions as uninterpreted functions and axiomatize them in the SMT queries. Alternatively, we encode these axioms in KeY as logical inference rules.

(A) Axiomatization in SMT We encode library functions as uninterpreted functions and include a set of axioms in the SMT-LIB translation for each method that is called in a benchmark. That is, we extended KeY such that when a transcendental function exists in the proof obligation, its definition alongside all the axioms for that function are added to the translation.

For the axiomatization of transcendentals, we did *not* add rules that expand to a definition or allow a repeated approximation of the function value (like expansion into a Taylor series). Instead, we added a number of lemmata encoding interesting properties related to special values. For instance, the following axiom states that if the input to the sin function is not a NaN or infinity, then the returned value of sin is between −1.0 and 1.0:

```
(assert (forall ((a Float64)) (=>
  (and (not (fp.isNaN a)) (not (fp.isInfinite a)))
  (and (fp.leq (sinDouble a) (fp #b0 #b01111111111 #b0000...000000))
       (fp.geq (sinDouble a) (fp #b1 #b01111111111 #b0000...000000))))))
```
Note that this implies that the result is not a NaN or infinity. The other axioms are similar in spirit, so we do not list them.

These axioms are expressed as quantified floating-point formulas and capture high-level properties of library functions complying with the specifications in the IEEE 754 Standard. Clearly, since we do not have the actual implementations of these functions, we are not able to prove arbitrary properties. However, such an axiomatization is often sufficient to check for the (absence of) special values, i.e. NaN and infinity, as our experiments in Section 4.4 show.

(B) Taclets in KeY Reasoning about quantified formulas in SMT is a longlasting challenge [34]. We have also observed in our experiments with only arithmetic operations (Section 4.3) that SMT solvers struggle with quantifiers in combination with floating-points. We have therefore implemented an alternative approach encoding the axioms not in the SMT queries, but instead as deductive inference rules (so-called taclets) in KeY.

The rules encode the same logical information as the universally quantified assertions that we add in SMT-LIB (and where we leave the choice of instantiations entirely to the SMT/SAT solver). With our taclet approach, we instantiate a quantifier (only) to one's needs. We note that for proving a property correct, this results in a correct (under)approximation. However, the prize for achieving


Table 1: Benchmark details and KeY automode statistics, time is measured in seconds

more closed proofs and shorter running times is that for disproving a property, not considering all possible quantifier instantiations may lead to spurious counterexamples, i.e., false positives.

A heuristic strategy applies the rules automatically using the occurrences of transcendentals as instantiation triggers. However, instantiating the axioms too eagerly, considerably increases the number of open goals, which is why we assume that the user selects the axioms to apply manually (and did so in the experiments). After the application the proof obligation can either be closed, i.e proven, by KeY automatically, or be given to the SMT solver as before for final solving.

Currently, the set of axioms (in the SMT-LIB translation and as taclets in KeY) only contains axioms for the transcendental functions occurring in our benchmarks. So far we have 10 axioms; however, adding more axioms (also for further transcendentals like exponentiation or logarithm) is straightforward. The full set of axioms is included in the Appendix of the technical report [3].

#### 4 Evaluation

#### 4.1 Benchmark Programs

We collected a set of existing floating-point Java programs representing realworld applications in order to evaluate the feasibility and performance of KeY's floating-point support.

The left half of Table 1 provides an overview of our benchmarks. Each benchmark consists of one method, which is composed of arithmetic operations

```
Listing 1.2: The Circuit.instantCurrent benchmark
public class Circuit {
double maxVoltage, frequency, resistance, inductance;
// ...
/*@ public normal_behavior
  @ requires 1.0 < this.maxVoltage && this.maxVoltage < 12.0 &&
  @ 1.0 < this.frequency && this.frequency < 100.0 &&
  @ 1.0 < this.resistance && this.resistance < 50.0 &&
  @ 0.001 < this.inductance && this.inductance < 0.004 &&
  @ 0.0 < time && time < 300.0;
  @ ensures !\fp_nan(\result) && !\fp_infinite(\result);
  @*/
public double instantCurrent(double time) {
  Complex current = computeCurrent();
  double maxCurrent = Math.sqrt(current.getRealPart() * current.getRealPart() +
    current.getImaginaryPart() * current.getImaginaryPart());
  double theta = Math.atan(current.getImaginaryPart() / current.getRealPart());
  return maxCurrent * Math.cos((2.0 * Math.PI * frequency * time) + theta);
}}
```
and method calls to potentially other classes. The invocations of methods from java.lang.Math (e.g. Math.abs) are marked by "+1" in Table 1; these are resolved by inlining the method implementation. For benchmarks that contain calls to transcendental functions and square root, the called functions are listed; these are handled by our axiomatization. We include sqrt in this list, as we have observed that exact support can be expensive, so it may be advantageous to handle sqrt axiomatically. Benchmarks Rectangle, Circuit, Matrix3 and Rotation are partially shown in Listings 1.1, 1.2, 1.3 and 1.4 respectively.

Each benchmark also includes a JML contract that is to be checked. For some methods, we specify two contracts (marked by "(2)" in the first column of Table 1), each serving as an independent benchmark. The contracts for most of these benchmarks check that the methods do not return a special value i.e infinity and/or NaN, the preconditions being that the variables are not themselves special values and possibly are bounded in a given range. For the Matrix, FPLoop and Rotate benchmarks, we check a *functional* property (see Section 4.3). FPLoop, which has three contracts, additionally shows how to specify floating-point loop behavior using loop invariants.

#### 4.2 Proof Obligation Generation

To reason about the contract of a selected benchmark, we apply KeY, which generates proof obligations or 'goals'. Some of these goals (heap-related) are closed by KeY automatically. The remaining open goals are closed by either SMT solvers with floating-point support directly (Section 3.1 and Section 3.2.A), or

with a combination of transcendental KeY taclets and floating-point SMT solving (Section 3.2.B).

Columns 6 and 7 in Table 1 show the number of proof obligations closed by KeY directly and to be discharged by external solvers, respectively. The next two columns show the number of taclet rules that KeY applied in order to close its goals, and the time this takes. For benchmarks with two contracts we show the respective values separated by '/'.

We run our experiments on a server with 1.5 TB memory and 4x12 CPU cores at 3 GHz. However, KeY runs single-threadedly and does not use more than 8GB of memory.

For our set of benchmarks, the symbolic execution process is fully automated. Note that the machinery can deal with loop invariants, if they are provided. Loop invariant generation is, however, particularly challenging for floating-points due to roundoff errors [27, 40], and a research topic in itself.

#### 4.3 Evaluation of SMT Floating-Point Support

Previous work [32] reported that SMT support for floating-point arithmetic is rather limited. However, with recent advances [18], we evaluate the situation again. Most benchmarks used to evaluate SMT solvers' decision procedures [1] aim to check (individual) specialized (corner case) properties of floating-point arithmetic. The proof obligations generated from our set of benchmarks are complementary in that they are more arithmetic heavy, while nonetheless relying on accurate reasoning about special values and functional properties.

For each open goal not automatically closed, KeY generates one SMT-LIB file that is fed to the solvers for validation. We compare the performance of the three major SMT solvers with floating-point support CVC4 [8] (version 1.8, with the SymFPU library [18] enabled), Z3 (4.8.9) [28] and MathSAT (5.6.3) [22]. For this we set a timeout of 300s for each proof obligation. While KeY is able to discharge proof obligations in parallel, for our experiments, we do so sequentially to maintain comparability.

KeY's default translation to SMT includes quantifiers. These quantifications are not related to floating-point arithmetic, but are used to logically encode important properties of the Java memory model, like the type hierarchy and the absence of dangling references on any valid Java heap. If we reason about floating-point problems in isolation, they are not needed, but if we want to consider Java verification more holistically with questions combining aspects of heap and floating point reasoning, they become essential. We manually inspected that the proof obligations without our axiomatized treatment of transcendental functions do not depend on these properties and investigate the quantifier support by including or removing them from the SMT translations. We do not report results with quantifiers for MathSAT, since it does not support them.

Table 2 summarizes the results of our experiments. Column 4 shows the number of expected valid or invalid goals for all benchmarks. For each solver we show the number of goals that each solver can validate or invalidate, together with the average time (in seconds) needed. The goals resulting in timeout were


Table 2: Summary of valid / invalid goals correctly decided and average running times of each solver for the SMT translations with and without quantified axioms

Fig. 1: Runtimes for valid goals with SMT translations *with* quantifiers

Fig. 2: Runtimes for valid goals with SMT translations *without* quantifiers

excluded from the computation of the average time. Column 3 shows whether the SMT queries include quantifiers or not.

Rows 1 and 2 of Table 2 show the results for benchmarks with valid contracts. This experiment thus represents the common behavior of KeY, whose main goal is to *prove* contracts correct. Rows 3 and 4 of Table 2 demonstrate the results for benchmarks with invalid contracts, i.e. for those we expect a counterexample for at least one of the goals. The Appendix of the technical report [3] contains the detailed results for each experiment separated by benchmark. Figure 1 and Figure 2 show a more detailed view of the solvers' running time for the valid benchmarks. The x-axis shows the number of open goals that are discharged by the SMT solvers, sorted by running time for each solver individually. The k-th point of one graph shows the minimum running time needed by the solver to close each of the k fastest goals. Note that each solver may have different goals which are its k fastest. The y-axis shows the time on a logarithmic scale.

We conclude that in the presence of quantified axioms and floating-point arithmetic solvers' performance deteriorate for both valid and invalid goals. In particular, none of the solvers is able to find counterexamples for any of the invalid goals. However, when the quantified axioms are removed from the

SMT translations, their performance improves. For valid contracts, CVC4 and MathSAT perform better than Z3, in terms of both number of goals validated and the running time per goal. In particular, MathSAT is able to prove all goals. However, the running time performance of CVC4 is better than MathSAT's. For invalid contracts, solvers are able to produce the expected counterexamples at least partially. Particularly, MathSAT has a better performance than CVC4 and Z3 in terms of both running time and the number of proof obligations for which it can produce counterexamples.

We conducted another experiment on our Rectangle.scale benchmark to assess the solvers' sensitivity to various changes, applied to the benchmark's contract or its implementation. We considered modifications such as reducing the number of classes while keeping the same functionality, having tighter and larger bounds for variables, reducing the number of arithmetic operations etc. The details of this experiment can be found in the Appendix of the technical report [3]. In summary, solvers' performance seems to be sensitive to slight innocuous looking changes such as the number of classes involved and variable bounds. For example, constraining arg2 in the original benchmark more tightly allows CVC4 to validate all goals (1 more). This behavior could be potentially exploited by e.g. relaxing a variable's bounds.

*Proving Functional Properties* Listings 1.3 and 1.4 show examples of functional properties that are expressible in floating-point arithmetic and that KeY can handle. The verification results are included in rows 1 and 2 of Table 2, for more details see the Appendix of the technical report [3].

For Matrix, we check that the determinants of a matrix and its transpose are equal. Note that this property holds trivially under real arithmetic, but not necessarily under floating-points. After feeding transposedEq (which uses the determinant method) and its contract to KeY, increasing the default timeout sufficiently and discharging the created goal, CVC4 generates a counterexample in 170.2s seconds and MathSAT in 16.2s. Z3 times out after 30 minutes. By feeding transposedEqV2 (which uses the determinantNew method) to KeY, CVC4 validates the contract in 1.1s, MathSAT in 3.9s and Z3 times out again. One thing worth noting is that the way programs are written can greatly influence the computational complexity needed to reject or verify the contract. This is evident from the fact that slightly modifying the order of operations (using determinantNew instead) substantially reduces verification time and changes the verification result for MathSAT and CVC4.

For Rotate, we check that the difference between an original vector and the one that is rotated four times by 90 degrees, must not be larger than 1.0E-15. We also verified the same bound for the relative difference (by exploiting another method and contract) for this benchmark. The constant cos90 in Listing 1.4 is not precisely 0.0 to account for rounding effects in the computation of the cosine. FPLoop includes three loops, for which the contracts check that the return value is bigger than a given constant.

Though not always very fast, these examples show that verification of functional floating-point properties is viable.

#### Listing 1.3: The Matrix3 benchmark

```
public class Matrix3 {
  double a, b, c, d, e, f, g, h, i; //The matrix: [[a b c],[d e f],[g h i]]
  double det;
  // method transpose not shown
  double determinant() {
    return (a * e * i+b * f * g+c * d * h) -
      (c * e * g+b * d * i+a * f * h);
  }
  double determinantNew() {
    return (a * (e * i) + (g * (b * f) + c * (d * h))) -
      (e * (c * g) + (i * (b * d) + a * (f * h)));
  }
  /*@ ensures \fp_normal(\result) ==> (\result == det); @*/
  double transposedEq() {
    det = determinant();
    return transpose().determinant();
  }
  /*@ ensures \fp_normal(\result) ==> (\result == det); @*/
  double transposedEqV2() {
    det = determinantNew();
    return transpose().determinantNew();
  }
}
```

```
Listing 1.4: The Rotation benchmark
```

```
public class Rotation {
  final static double cos90 = 6.123233995736766E-17;
  final static double sin90 = 1.0;
  // rotates a 2D vector by 90 degrees
  public static double[] rotate(double[] vec) {
    double x = vec[0] * cos90 - vec[1] * sin90;
    double y = vec[0] * sin90 + vec[1] * cos90;
    return new double[]{x, y};
  }
  /*@ requires (\forall int i; 0 <= i && i < vec.length;
    @ \fp_nice(vec[i]) && vec[i] > 1.0 && vec[i] < 2.0) && vec.length == 2;
    @ ensures \result[0] < 1.0E-15 && \result[1] < 1.0E-15;
    */
  public static double[] computeError(double[] vec) {
    double[] temp = rotate(rotate(rotate(rotate(vec))));
    return new double[]{Math.abs(temp[0] - vec[0]), Math.abs(temp[1] - vec[1])};
  }
}
```
#### 4.4 Evaluation of Support for Transcendental Functions in KeY

We evaluated the two approaches from Section 3.2.A on our set of benchmarks; rows 5 and 6 in Table 2 summarize the results. (The detailed results of these experiments are included in the Appendix of the technical report [3].) Note that both approaches are fully automated.

We conclude that the SMT solvers perform better when the axiomatization is applied at the KeY level. When axioms for transcendental functions are added to the SMT-LIB translation directly Z3 validates 4 out of 10 goals. With the axiomatization at the KeY level, solvers are able to validate more goals (with quantified formulas removed from the SMT translations), e.g. Z3 is able to validate 5 goals and CVC4 can validate all. Therefore, it is preferable to apply them on the KeY side via taclet rules.

All the solvers we have used in this work comply with the IEEE 754 standard and therefore have bit-precise support for the square root function. They provide bit-precise reasoning by effectively encoding the behavior of floating-point circuits over bitvectors (which is naturally expensive), together with different heuristics and abstractions to speed up solving time. However, depending on the property, we do not always need bit-precise reasoning, so we propose handling the square root function with the same taclet-based axiomatization as introduced in Section 3.2.B.

To this end, we conducted an experiment on the benchmarks containing sqrt, comparing the approach from Section 3.2.B (adding the necessary axioms, resp. taclet rules) to using the square root implemented in SMT solvers (fp.sqrt). We chose to include only axioms specified in or inferred from the IEEE 754 standard (e.g. if the argument of the square root function is NaN or less than zero, then the square root results in NaN). The full set of axioms that we used is included in the Appendix of the technical report [3].

Rows 7 and 8 in Table 2 summarize the results for this experiment; the detailed results are included in the Appendix of the technical report [3]. We observed that for two out of the three benchmarks, the average running time of all solvers decreases using the axiomatized square root. Furthermore, Z3 is able to reason about more proof obligations with the axiomatized version. However, the success of this approach depends on the axioms added to KeY and may not always work if we do not have suitable axioms. For example, for the Circuit.instantCurrent benchmark (Listing 1.2), using the axiomatized square root, CVC4 is not able to validate the contract, but with fp.sqrt the contract is validated.

In summary, treating sqrt axiomatically can result in shorter solving times than performing bit-precise reasoning, but the approach may not always succeed when the axioms are not sufficient to prove a particular property.

#### 4.5 Discussion and insights

The experiments show that highly automated floating point program verification is viable for relevant properties (handling of special values and some functional properties), up to a certain level of complexity (given by the SMT solvers). The choices of which parts of a proof obligation are delegated to SMT, and how they

are translated to SMT, are crucial for achieving effective and efficient program verification. Arithmetic operations proved to be more efficiently dealt with by delegation to SMT, whereas for transcendental functions, axiomatization and rule based treatment in the theorem prover, outside the SMT solver, performs clearly better.

#### 5 Related Work

Our implementation uses the floating-point SMT-LIB theory [17], which however does not handle transcendental functions, as their semantics is (library) implementation dependent. Some real-valued automated solvers do handle transcendental functions [5, 33], but to the best of our knowledge, the combination of floating-points and reals in SMT solvers is still severely limited.

None of the existing deductive verifiers support floating-point transcendental functions automatically. The Why3 deductive verification framework [30] has support for floating-point arithmetic, with front-ends for the C and Ada programming languages through Frama-C [25] and SPARK [19, 32], respectively. Why3 has back-end support for different SMT solvers, as well as interactive proof assistants like Coq. Until recently, Why3 would discharge still many interesting floating-point problems with help of Coq, relying on significant user interaction. In later work [32] (in the context with floating-point verification for Ada programs), Why3 can achieve a higher degree of automation. Note, however, that the user is still required to add code assertions as well as 'ghost code' to a significant extent.

The Boogie intermediate verification language [47] also supports floatingpoint expressions, and targets Z3 for discharging proof obligations. In the Boogie community, it was observed that writing a specification in Boogie leads to decreases in SMT solver performance when compared to writing the goal in SMT-LIB directly, probably due to an inherent mixing of theories when using Boogie [2]. This matches our own experiences, and separation of theories should be considered an important task for the further development of floating-point verification.

Other deductive verifiers for Java have only rudimentary support for floatingpoints. Verifast [41] treats floating-point operations as if they were real values, and OpenJML [23] parses programs with floating-point operations, but essentially treats float and double as uninterpreted sorts.

The Java category of verification competition SV-COMP [12] contains a number of benchmarks that make use of floating-point variables. However, the focus of these benchmarks is usually not on arithmetical properties of expressions, but on the completeness of the Java language support. Amongst the participants of SV-COMP 2020, the Symbolic (Java) Pathfinder (SPF) [55] (and various extensions) and the Java Bounded Model Checker (JBMC) [24] support floating-point arithmetic. Besides being limited to exploring the state space up to a bounded depth, their constraint languages do not support quantifiers and abstracting of method calls—which are features that we have used in this work.

Floating-point arithmetic has also been formalized in several interactive theorem provers [16, 31, 42]. While one can prove intricate properties about floating-point programs [14,15,38], proofs using interactive provers are to a large part manual and require significant expertise.

Abstract interpretation based techniques can show the absence of special values in floating-point code fully automatically, and several abstract domains which are sound with respect to floating-point arithmetic exist [20,43]. While the analysis itself is fully automated, applying it successfully to real-world programs in general requires adaptation to each program analyzed by end-users, e.g. the selection of suitable abstract domains or widening thresholds [13].

Besides showing the absence of special values, recent research has developed static analyses to bound floating-point roundoff errors [26, 35, 48, 52, 57]. These analyses currently work only for small arithmetic kernels and the tools in particular do not accept programs with objects.

Dynamic analyses generally scale well on real-world programs, but can only identify bugs (when given failure-triggering input), rather than proving correctness for *all* possible inputs. Executing a floating-point program together with a higherprecision one allows one to find inputs which cause large roundoff errors [11,21,44]. Ariadne [7] uses a combination of symbolic execution, real-valued SMT solving and testing to find inputs that trigger floating-point exceptions, including overflow and invalid operations. Our work subsumes this approach as the SMT solvers that we use can directly generate counterexamples, but more importantly, KeY is able to prove the absence of such exceptions.

#### 6 Conclusion

By joining the forces of rule-based deduction and SAT-based SMT solving, we presented the first working floating-point support in a deductive verification tool for Java and by that close a remaining gap in KeY to now support full sequential Java. Our evaluation shows that for specifications dealing with value ranges and absence of NaN and infinity, our approach can verify realistic programs within a reasonable time frame. We observe that the MathSAT and CVC4 solver's floatingpoint support scales sufficiently for our benchmarks, as long as the queries do not include any quantifiers, and that our axiomatized approach for handling transcendental functions is best realized using calculus rules in KeY's internal reasoning engine. While our work is implemented within the KeY verifier, we expect our approach to be portable to other verifiers.

#### Acknowledgements

This research was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) project 387674182. The authors would like to thank Daniel Eddeland, who together with co-author W. Ahrendt performed prestudies which impacted the current work.

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Helmholtz: A Verifier for Tezos Smart Contracts Based on Refinement Types**

Yuki Nishida1(-) , Hiromasa Saito1, Ran Chen1, Akira Kawata1, Jun Furuse2, Kohei Suenaga<sup>1</sup> , and Atsushi Igarashi<sup>1</sup>

<sup>1</sup> Kyoto University, Kyoto, Japan {nishida,hsaito,aran,akira,ksuenaga,igarashi}@fos.kuis.kyoto-u.ac.jp <sup>2</sup> DaiLambda, Inc., Kyoto, Japan jun.furuse@dailambda.jp

**Abstract.** A smart contract is a program executed on a blockchain, based on which many cryptocurrencies are implemented, and is being used for automating transactions. Due to the large amount of money that smart contracts deal with, there is a surging demand for a method that can statically and formally verify them.

This tool paper describes our type-based static verification tool Helmholtz for Michelson, which is a statically typed stack-based language for writing smart contracts that are executed on the blockchain platform Tezos. Helmholtz is designed on top of our extension of Michelson's type system with refinement types. Helmholtz takes a Michelson program annotated with a user-defined specification written in the form of a refinement type as input; it then typechecks the program against the specification based on the refinement type system, discharging the generated verification conditions with the SMT solver Z3. We briefly introduce our refinement type system for the core calculus Mini-Michelson of Michelson, which incorporates the characteristic features such as compound datatypes (e.g., lists and pairs), higher-order functions, and invocation of another contract. Helmholtz successfully verifies several practical Michelson programs, including one that transfers money to an account and that checks a digital signature.

#### **1 Introduction**

A blockchain is a data structure to implement a distributed ledger in a trustless yet secure way. The idea of blockchains is initially devised for the Bitcoin cryptocurrency [12] platform. Many cryptocurrencies are implemented using blockchains, in which value equivalent to a significant amount of money is exchanged.

Recently, many cryptocurrency platforms allow programs to be executed on a blockchain. Such programs are called smart contracts [19] (or, simply a contract in this paper) since they work as a device to enable automated execution of a contract. In general, a smart contract is a program P<sup>a</sup> associated with an account

<sup>-</sup>Current affiliation: Preferred Networks, Inc.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 262–280, 2021.

https://doi.org/10.1007/978-3-030-72013-1 14

a on a blockchain. When the account a receives money from another account b with a parameter v, the computation defined in P<sup>a</sup> is conducted, during which the state of the account a (e.g., the balance of the account and values that are stored by the previous invocations of Pa) which is recorded on the blockchain may be updated. The contract P<sup>a</sup> may execute money transactions to another account (say c), which results in invocations of other contracts (say Pc) during or after the computation; therefore, contract invocations may be chained.

Although smart contracts' original motivation was handling simple transactions (e.g., money transfer) among the accounts on a blockchain, recent contracts are being used for more complicated purposes (e.g., establishing a fund involving multiple accounts). Following this trend, the languages for writing smart contracts also evolve from those that allow a contract to execute relatively simple transactions (e.g., Script for Bitcoin) to those that allow a program that is as complex as one written in standard programming languages (e.g., EVM for Ethereum and Michelson [1] for Tezos [4]).

Due to a large amount of money they deal with, verification of smart contracts is imperative. Static verification is especially needed since a smart contract cannot be fixed once deployed on a blockchain. Attack on a vulnerable contract indeed happened. For example, the DAO attack, in which the vulnerability of a fundraising contract was exploited, resulted in the loss of cryptocurrency equivalent to approximately 150M USD [18].

In this paper, we describe our type-based static verifier Helmholtz<sup>3</sup> for smart contracts written in Michelson. The Michelson language is a statically- and simply typed stack-based language equipped with rich data types (e.g., lists, maps, and higher-order functions) and primitives to manipulate them. Although several high-level languages that compile to Michelson are being developed, Michelson is most widely used to write a smart contract for Tezos as of writing.

A Michelson program expresses the above computation in a purely functional style, in which the Michelson program corresponding to P<sup>a</sup> is defined as a function. The function takes a pair of the parameter v and a value s that represents the current state of the account (called storage) and returns a pair of a list of operations and the updated storage s . Here, an operation is a Michelson value that expresses the computation (e.g., transferring money to an account and invoking the contract associated with the account) that is to be conducted after the current computation (i.e., Pa) terminates. After the computation specified by P<sup>a</sup> finishes with a pair of a storage value and an operation list, a blockchain system invokes the computation specified in the operation list. This purely functional style admits static verification methods for Michelson programs similar to those for standard functional languages.

As the theoretical foundation of Helmholtz, we design a refinement type system for Michelson as an extension of the original simple type system. In contrast to standard refinement types that refine the types of values, our type

<sup>3</sup> Hermann von Helmholtz (1821–1894), a German physicist and physician, was a doctoral advisor of Albert A. Michelson (1852–1931), whom the Michelson language is apparently named after.

system refines the type of stacks. We briefly describe our type system in Section 3; a detailed explanation is deferred to a future paper.

We show that our tool can verify several practical smart contracts. In addition to the contracts we wrote ourselves, we apply our tool to the sample Michelson programs used in Mi-cho-coq [3], a formalization of Michelson in Coq proof assistant [21]. These contracts consist of practical contracts such as one that checks a digital signature and one that transfers money.

We note that Helmholtz currently supports approximately 80% of the whole instructions of the Michelson language. Another limitation of the current Helmholtz is that it can verify only a single contract, although one often uses multiple contracts for an application, in which a contract may call another by a money transfer operation, and their behavior as a whole is of interest. We are currently extending Helmholtz so that it can deal with more programs.

Our contribution is summarized as follows: (1) Definition of the core calculus Mini-Michelson and its refinement type system; (2) Automated verification tool Helmholtz for Michelson contracts implemented based on the type system of Mini-Michelson; the interface to the implementation can be found at https: //www.fos.kuis.kyoto-u.ac.jp/trylang/Helmholtz; and (3) Evaluation of Helmholtz with various Michelson contracts, including practical ones.

The rest of this paper is organized as follows. Before introducing the technical details, we present an overview of the verifier Helmholtz in Section 2 using a simple example of a Michelson contract. Section 3 introduces the core calculus Mini-Michelson and its refinement type system. Section 4 describes the verifier Helmholtz, a case study, and experimental results. After discussing related work in Section 5, we conclude in Section 6.

#### **2 Overview of Helmholtz and Michelson**

We overview our tool Helmholtz in this section before presenting its technical details. We also explain Michelson by example (Section 2.2) and user-written annotation added to a Michelson program for verification purposes (Section 2.3).

#### **2.1 Helmholtz**

As input, Helmholtz takes a Michelson program annotated with (1) its specification expressed in a refinement type and (2) additional user annotations such as loop invariants. It typechecks the annotated program against the specification using our refinement type system; the verification conditions generated during the typechecking is discharged by the SMT solver Z3 [11]. If the code successfully typechecks, then the program is guaranteed to satisfy the specification.

Helmholtz is implemented as a subcommand of tezos-client, the client program of the Tezos blockchain. For example, to verify boomerang.tz in Figure 1, we run tezos-client refinement boomerang.tz. If the verification succeeds, the command outputs VERIFIED to the terminal screen (with a few log messages); otherwise, it outputs UNVERIFIED.

**Fig. 1.** boomerang.tz. The comment inside /\* \*/ describes the stack at the program point.

#### **2.2 An Example Contract in Michelson**

Figure 1 shows an example of a Michelson program called boomerang. A Michelson program is associated with an account on the Tezos blockchain; the program is invoked by transferring money to this account. This artificial program in Figure 1, when it is invoked, is supposed to transfer the received money back to the account that initiated the transaction.

A Michelson program starts with type declarations of its parameter, whose value is given by contract invocation, and storage, which is the state that the contract account stores. Lines 1–2 declare that the types of both are unit, the type inhabited by the only value Unit. Lines 3–6 surrounded by << and >> are a user-written annotation used by Helmholtz for verification; we will explain this annotation later. The code section in Lines 8–24 is the body of this program.

Let us take a look at the code section of the program. In the following explanation of each instruction, we describe the state of the stack after each instruction as comments; stack elements are delimited by .


A Michelson program is supposed to finish its execution with a singleton stack whose unique element is a pair of (1) a list of operations to be executed after the current execution of the contract finishes and (2) the new value for the storage.

Michelson is a statically typed language. Each instruction is associated with a typing rule that specifies the shapes of stacks before and after it by a sequence of simple types such as int and int list. For example, CONS requires the type of top element to be T and that of the second to be T list (for any T); it ensures the top element after it has type T list.

Other notable features of Michelson include first-class functions, hashing, instructions related to cryptography such as signature verification, and manipulation of a blockchain using operations.

#### **2.3 Specification**

A user can specify the behavior of a program by a ContractAnnot annotation, which is a part of the augmented syntax of our verification tool. A ContractAnnot annotation gives a specification of a Michelson program by the following notation inspired by the refinement types: {(param,st) | pre} -> {(ops,st') | post} & {exc | abpost} where pre, post, and abpost are predicates. This specification reads as follows: if this program is invoked with a parameter param and storage st that satisfies the property pre, then (1) if the execution of this program succeeds, then it returns a list of operations ops and new storage storage' that satisfy the property post; (2) if this program raises an exception with value exc, then exc satisfies abpost. The specification language is expressive enough to cover the specifications for practical contracts, including the ones we used in the experiments in Section 4.3. In the predicates, one can use several keywords such as amount for the amount of the money sent to this program when it is invoked and source for the source account's address.

The ContractAnnot annotation in Figure 1 (Lines 3–6) formalizes this program's specification as follows. This program can take any parameter and storage (Line 3). Successful execution of this program results in a pair (ops,st') that satisfies the condition in Lines 4–5 that expresses (1) if amount = 0, then ops is empty, that is, no operation will be issued; (2) if amount > 0, then ops is a list of a single element Transfer Unit amount (Contract source), which expresses transfer of money of the amount amount to the account at source with the unit argument.<sup>4</sup> In the specification language, source and amount are keywords that stand for the source account and the amount of money sent to this program, respectively. The part &{\_| False } expresses that this program does not raise an exception. This specification correctly formalizes the intended behavior of this program.

#### **3 Refinement Type System for Mini-Michelson**

In this section, we formalize Mini-Michelson, a core subset of Michelson with its syntax, operational semantics, and refinement type system. We also state that the type system is sound. We omit many features from the full language in favor of conciseness but includes language constructs—such as higher-order functions and iterations—that make verification difficult.

Figure 2 shows the syntax of Mini-Michelson. Values, ranged over by V , consist of integers i; addresses a; operations transaction (V, i, a) to invoke a contract at a by sending money of amount i and an argument V ; pairs (V1, V2) of values; the empty list [ ]; cons <sup>V</sup><sup>1</sup> :: <sup>V</sup>2; and code IS of first-class functions.<sup>5</sup>

<sup>4</sup> As we mentioned in Section 1, Helmholtz can currently verify the behavior of a single contract, although there will be an invocation of the contract associated with source after the termination of boomerang. An operation is treated as an opaque data structure, from which one cannot extract values.

<sup>5</sup> Closures are not needed because functions in Michelson can access only arguments.

```
V ::= i | a | transaction (V, i, a) | (V1, V2) | [ ] | V1 :: V2 | IS
T ::= int | address | operation | T1 × T2 | T list | T1 → T2
IS ::= {I1; ... ; In}
 I ::= IS | DIP IS | DROP | DUP | SWAP | PUSH T V | NOT | ADD | IF IS 1 IS 2 |
       LOOP IS | PAIR | CAR | CDR | NIL T | CONS | IF_CONS IS 1 IS 2 | ITER IS |
       LAMBDA T1 T2 IS | EXEC | TRANSFER_TOKENS T
```
**Fig. 2.** Syntax of Mini-Michelson

Unlike Michelson, we use integers as a substitute for Boolean values so that 0 means false and the others mean true. Simple types, ranged over by T, consist of base types (int, address, and operation, which are self-explanatory), pair types T<sup>1</sup> × T2, list types T list, and function types T<sup>1</sup> → T2. Instruction sequences, ranged over by IS, are a sequence of instructions, ranged over by I, enclosed by curly braces. A Mini-Michelson program is an instruction sequence.

Instructions include those for stack manipulation (to DROP, DUPlicate, SWAP, and PUSH values); NOT and ADD for manipulating integers; PAIR, CAR, and CDR for pairs; NIL and CONS for constructing lists; and TRANSFER\_TOKENS to create an operation that expresses a money transfer after the current contract execution. The instruction IF branches depending on whether the stack top is 0 or not; IF\_CONS branches on whether the stack top is a cons or not. The instruction LOOP IS repeats IS as long as the stack top is a nonzero integer at the loop entry; ITER IS is for iterating the list at the stack top. LAMBDA pushes a function (described by its operand IS) onto the stack, and EXEC calls a function. Perhaps unfamiliar is DIP IS, which pops and saves the stack top somewhere else, executes IS, and then pushes the saved value back.

We also use a few kinds of stacks in the following definitions: value stacks, ranged over by S, type stacks, ranged over by T¯, and type binding stacks, ranged over by Υ, of the form x<sup>1</sup> : T<sup>1</sup>  ..  x<sup>n</sup> : Tn. The empty stack is denoted by ‡, and push is by . We often omit the empty stack and write, for example, V<sup>1</sup>  V<sup>2</sup> for V<sup>1</sup>  V<sup>2</sup>  ‡. Intuitively, T<sup>1</sup>  ..  T<sup>n</sup> and x<sup>1</sup> : T<sup>1</sup>  ..  x<sup>n</sup> : T<sup>n</sup> describe stacks V<sup>1</sup>  ..  V<sup>n</sup> where each value V<sup>i</sup> is of type Ti. We will use variables to name stack elements in the refinement type system.

Mini-Michelson (as well as Michelson) is equipped with a simple type system. The type judgment for instructions is written <sup>T</sup>¯ & <sup>I</sup> <sup>⇒</sup> <sup>T</sup>¯ , which means that instruction I transforms a stack of type T¯ into another stack of type T¯ . The type judgment for values is written V : T, which means that V is given simple type T. We omit typing rules as they are fairly straightforward.

#### **3.1 Operational Semantics**

We give a big-step operational semantics of Mini-Michelson by defining the judgment S & I ⇓ S , which means that executing the instruction I under the stack S results in the stack S , (and also S & IS ⇓ S ). Most rules for S & I ⇓ S are straightforward. We show rules for DIP and LOOP below and omit other rules.

$$\begin{array}{ccc} S \vdash IS \Downarrow S'\\ \hline V \rhd S \vdash \mathsf{DIP} \, IS \Downarrow V \rhd S' \end{array} \quad \begin{array}{c} S \vdash IS \Downarrow S' \\ \hline i \rhd S \vdash \mathsf{LDP} \, IS \Downarrow S' \end{array} \quad \begin{array}{c} (i \neq 0) \\ (i \neq 0) \\ \hline 0 \rhd S \vdash \mathsf{LDP} \, IS \Downarrow S' \end{array} \quad \begin{array}{c} (i \neq 0) \\ \hline 0 \rhd S \vdash \mathsf{LDP} \, IS \Downarrow S' \end{array}$$

The first rule means that the body IS is executed with the stack S obtained by removing the top element V , which is pushed back onto the resulting stack S . There are two rules for LOOP: the first rule means that if the stack top is nonzero, then the body is executed, and then the execution of LOOP IS is repeated; the second rule means that, if the stack top is zero, then the loop acts as a no-op.

#### **3.2 Refinement Type System**

In the refinement type system, a simple stack type T<sup>1</sup>  ..  T<sup>n</sup> is augmented with a formula ϕ of first-order logic to describe the relationship among stack elements. We introduce refinement stack types, ranged over by Φ, of the form {x<sup>1</sup> : T<sup>1</sup>  ...  x<sup>n</sup> : T<sup>n</sup> | ϕ(x1, ... , xn)}, which denotes stacks V<sup>1</sup>  ..  V<sup>n</sup> such that V<sup>1</sup> : T1, ..., V<sup>n</sup> : T<sup>n</sup> and ϕ(V1, ... , Vn) hold.

We show (part of) the syntax of terms and formulae of the first-order logic:

$$\begin{array}{lclclcl} t & ::= & x & V & \mathbf{t} \text{ transaction} \begin{pmatrix} t\_1, t\_2, t\_3 \end{pmatrix} & t\_1 :: t\_2 & \mid \begin{pmatrix} t\_1, t\_2 \end{pmatrix} & t\_1 + t\_2 \mid \cdots \\ \varphi & ::= t\_1 = t\_2 & \mid \mathbf{ca1} \begin{pmatrix} t\_1, t\_2 \end{pmatrix} = t\_3 & \mid \varphi\_1 \lor \varphi\_2 \mid \neg \varphi \mid \exists x : T. \varphi \mid \cdots \end{array}$$

The language for predicates is multi-sorted, where a sort is a simple type of Michelson. The sorting rules for term constructors and relation symbols are standard. For example, in t<sup>1</sup> + t2, both t<sup>1</sup> and t<sup>2</sup> have to be of sorts int; and in t<sup>1</sup> = t2, the sorts of t<sup>1</sup> and t<sup>2</sup> must be the same, and so on. The only relation symbol worth explaining is call (t1, t2) = t3, which informally means that calling function t<sup>1</sup> with argument t<sup>2</sup> (as the only element of the input stack) yields a stack consisting only of t<sup>3</sup> as a result. We use other predicates, connectives, and quantifiers such as t<sup>1</sup> = t2, ϕ<sup>1</sup> ∧ ϕ12, ϕ<sup>1</sup> =⇒ ϕ2, and ∀ x : T.ϕ, which can be considered as derived forms.

We define the semantics of the formulae in a standard manner. Let σ be a value assignment, i.e., a sort-respecting finite map from variables to values. We define the interpretation [[t]]<sup>σ</sup> of t under σ and valid formulae under a value assignment, denoted by σ |= ϕ; for call (t1, t2) = t3, we define σ |= call (t1, t2) = t<sup>3</sup> iff [[t2]]<sup>σ</sup>  ‡ & [[t1]]<sup>σ</sup> ⇓ [[t3]]<sup>σ</sup>  ‡. Equality on instruction sequences is intensional: formula IS = IS is valid only if IS and IS are syntactically equal.

For a finite mapping Γ (called a type environment) from variables to sorts, Γ |= σ and Γ |= ϕ are defined as usual: Γ |= σ iff dom (σ) = dom (Γ) and σ(x) : Γ(x) for any x ∈ dom (σ); Γ |= ϕ iff σ |= ϕ for any value assignment σ such that Γ |= σ.

The type system is equipped with subtyping whose judgment is of the form Γ & Φ<sup>1</sup> <: Φ2, which means stack type Φ<sup>1</sup> is a subtype of Φ<sup>2</sup> under Γ. The type judgment for instructions (resp. instruction sequences) is of the form Γ & Φ<sup>1</sup> I Φ<sup>2</sup> (resp. Γ & Φ<sup>1</sup> IS Φ2), which means that, under Γ, if I (resp. IS) is executed under a stack satisfying Φ1, the resulting stack (if the execution terminates) satisfies Φ2. We often call Φ<sup>1</sup> pre-condition and Φ<sup>2</sup> post-condition.

We show representative typing rules in Figure 3.

$$\frac{\Gamma, x:T \vdash \{\mathcal{T} \mid \varphi\} \ IS \; \{\mathcal{T}' \mid \varphi'\}}{\Gamma \vdash \{x:T \rhd \mathcal{T} \mid \varphi\} \text{ DIP } IS \; \{x:T \rhd \mathcal{T}' \mid \varphi'\}}\tag{RT-\text{DIP}}$$

$$\frac{\Gamma \vdash \{\mathcal{T} \mid \exists \ x : \mathsf{int}.\varphi \land x \neq 0\} \; IS\_1 \; \Phi \qquad \Gamma \vdash \{\mathcal{T} \mid \exists \ x : \mathsf{int}.\varphi \land x = 0\} \; IS\_2 \; \Phi}{\Gamma \vdash \{x : \mathsf{int}.\mathcal{T} \mid \varphi\} \; \mathsf{IF} \; IS\_1 \; IS\_2 \; \Phi} \qquad \text{(R7-Ir)}$$

$$\frac{\begin{array}{c}\Gamma \vdash \{\mathcal{T} \mid \exists \, x:\, \mathsf{int}.\varphi \land x \neq 0\} \; IS \; \{x:\, \mathsf{int}\, \mathcal{T} \mid \varphi\} \\\hline \Gamma \vdash \{x:\, \mathsf{int}\, \mathcal{T} \mid \varphi\} \; \mathsf{L}\mathsf{0}\mathsf{P} \; IS \; \{\mathcal{T} \mid \exists \, x:\, \mathsf{int}.\varphi \land x = 0\} \end{array}}{\begin{array}{c}\Gamma \vdash \{\mathcal{T} \mid \exists \, x:\, \mathsf{int}.\varphi \land x = 0\} \end{array}} \tag{\mathsf{RT-Loop}}$$

$$\begin{array}{c} \begin{array}{c} y'\_1: T\_1 \vdash \{ y\_1: T\_1 \mid y'\_1 = y\_1 \land \varphi\_1 \} \text{ } IS \left\{ y\_2: T\_2 \mid \varphi\_2 \right\} \\ \hline T \vdash \{ \Upsilon \mid \varphi \} \text{ } \texttt{LMBDA} \, T\_1 \, T\_2 \, IS \end{array} \\ \begin{array}{c} \{ x: T\_1 \to T\_2 \rhd \mathcal{T} \mid \varphi \land \forall \, y'\_1: T\_1, y\_2: T\_2. \varphi\_1 \left[ y\_1 := y'\_1 \right] \land \texttt{c11} \left( x, y'\_1 \right) = y\_2 \implies \varphi\_2 \} \\ \end{array} \\ \begin{array}{c} \begin{array}{c} \text{( $T$ -LAMBA)} \end{array} \end{array} \end{array}$$

$$\begin{array}{c} \overline{\Gamma \vdash \{x\_1:T\_1 \rhd x\_2:T\_1 \to T\_2 \rhd \mathscr{T} \mid \varphi\}} \text{ EZE \{x\_3:T\_2 \rhd \mathscr{T} \mid \exists \ x\_1:T\_1, x\_2:T\_1 \to T\_2.\varphi \land \mathsf{call1}\left(x\_2, x\_1\right) = x\_3\} \end{array} \tag{R7-\text{Exec}}$$

$$\frac{\Gamma \vdash \Phi\_1 <: \Phi\_1' \qquad \Gamma \vdash \Phi\_1' \ I \ \Phi\_2' \qquad \Gamma \vdash \Phi\_2' <: \Phi\_2}{\Gamma \vdash \Phi\_1 \ I \ \Phi\_2} \tag{RT-\text{Susp}}$$

#### **Fig. 3.** Typing rules (excerpt)


<sup>6</sup> The scope of a variable in a refinement stack type is its predicate part and so y<sup>1</sup> cannot appear in the post-condition.

the pre- and post-conditions, respectively, of function x2. If x<sup>1</sup> satisfies ϕ1, then we can derive that ϕ<sup>2</sup> holds.

**–** (RT-Sub) is the rule for subsumption to strengthening the pre-condition and weakening the post-condition. In our type system, subtyping is defined semantically: A subtyping judgment Γ & {Υ | ϕ1} <: {Υ | ϕ2} holds if for any σ such that ∀x ∈ dom (Γ, Υ).σ(x):(Γ, Υ)(x), σ |= ϕ<sup>1</sup> =⇒ ϕ<sup>2</sup> is valid. (Here, by abuse of notation, the type binding stack Υ is regarded as a mapping from variables to sorts.)

We state that our type system is sound: For a well-typed instruction, if we execute the instruction under a stack that satisfies the pre-condition of the typing, then (if the execution halts) the resulting stack satisfies the post-condition of the typing. To state the soundness theorem, we define an auxiliary relation Γ |= S : Φ, which means "stack S satisfies stack refinement type Φ under environment Γ", by: Γ |= V<sup>1</sup>  ..  V<sup>m</sup> : {y<sup>1</sup> : T <sup>1</sup>  ..  y<sup>m</sup> : T <sup>m</sup> | ϕ} ⇐⇒ V<sup>1</sup> : T <sup>1</sup>,...,V<sup>m</sup> : T <sup>m</sup> and σ[y<sup>1</sup> → V1, .. , y<sup>m</sup> → Vm] |= ϕ for any σ such that Γ |= σ.

Then, the soundness theorem, whose proof will appear in a forthcoming full version, is stated as follows:

**Theorem 1 (Soundness).** If Γ & Φ<sup>1</sup> IS Φ2, Γ |= S : Φ1, and S & IS ⇓ S , then Γ |= S : Φ2.

**Sketch of Typechecking** We implement a typechecking algorithm as follows. Given a type environment, a pre-condition, and a post-condition, our algorithm computes the strongest post-condition of the code starting from the given precondition. This computation is conducted according to the syntax-directed version of the typing rules created essentially in the same way as a type system with subtyping (e.g., one described in [15]). An application of the subtyping generates verification conditions. The accumulated verification conditions are fed to Z3; the typechecking succeeds if they are successfully discharged.

#### **3.3 Extensions**

The implementation supports a few extensions of the formalization explained above, which are explained below.

The type system implemented in Helmholtz is extended with refinements for values thrown by raising exceptions. For example, the typing rule for instruction FAILWITH, which raises an exception with the value at the stack top, is given as follows:

Γ & {x : T Υ | ϕ} FAILWITH {Υ |⊥}&{err | ∃ x : T, Υ.ϕ ∧ x = err}.

The rule expresses that, if FAILWITH is executed under a non-empty stack that satisfies ϕ, then the program point just after the instruction is not reachable (hence, {Υ |⊥}). The refinement ∃ x : T, Υ.ϕ ∧ x = err for the exception case states that ϕ in the pre-condition with the top element x is equal to the raised value err; since x is not in the scope in the exception refinement, x is bound by an existential quantifier. The typing rules for the other instructions can be extended with the "&" part easily.

Helmholtz deals with measure functions introduced by Kawaguchi et al. [9] and supported by Liquid Haskell [23]. If a measure function is defined by a Measure annotation, Helmholtz "weaves" the function definition into relevant typing rules. For instance, given the annotation Measure len : list int -> int where [] = 0 | h :: t = (1 + len t), Helmholtz assumes an uninterpreted function symbol len and augments (RT-Nil) and (RT-Cons) as follows, where the last equality in each post-condition comes from the definition of len.

$$T \vdash \{ \mathcal{T} \mid \varphi \} \text{ WIL } T \; \{ x : T \mathbf{1} \mathbf{1} \mathbf{s} \mathbf{t} \rhd \mathcal{T} \mid \varphi \land x = [] \land \mathbf{1} \mathbf{e} \mathbf{n} [] = 0 \} $$

Γ {x<sup>1</sup> : T x<sup>2</sup> : T list Υ | ϕ} CONS {x<sup>3</sup> : T list Υ | ∃ x<sup>1</sup> : T,x<sup>2</sup> : T list.ϕ ∧ x<sup>1</sup> :: x<sup>2</sup> = x<sup>3</sup> ∧ len (x<sup>1</sup> :: x2)=1+ len x2}

#### **4 Tool Implementation**

In this section, we discuss annotations in detail, show a case study of contract verification, and present verification experiments.

#### **4.1 Annotations**

Helmholtz supports several forms of annotations (surrounded by << and >> in the source code), other than ContractAnnot explained in Section 2.

Assert Φ and Assume Φ can appear before or after an instruction. The former asserts that the stack at the annotated program location satisfies the type Φ; the assertion is verified by Helmholtz. If there is an annotation Assume Φ, Helmholtz assumes that the stack satisfies the type Φ at the annotated program location. A user can give a hint to Helmholtz by using Assume Φ. The user has to make sure that it is correct; if an Assume annotation is incorrect, the verification result may be incorrect.

LoopInv Φ asserts the loop invariant of a loop instruction (e.g., LOOP and ITER). In the current implementation, annotating a loop invariant using LoopInv Φ is mandatory. Helmholtz checks that Φ is indeed a loop invariant and uses it to verify the rest of the program.

In the current implementation, a LAMBDA instruction, which pushes a function on the top of the stack, must be accompanied by the LambdaAnnot annotation, where Φpre → Φpost & Φabpost is a specification of the pushed function and the bindings (x<sup>1</sup> : T1,...,x<sup>n</sup> : Tn) introduce the ghost variables that can be used in the annotations in the body of the annotated LAMBDA instruction;<sup>7</sup> one can omit the declaration of ghost variables if it is empty. The first contract in Figure 4, which pushes a function that takes a pair of integers and returns the sum of them, presents an example of LambdaAnnot. The annotated type of the function (Line 5)

<sup>7</sup> ContractAnnot also allows declarations of ghost variable used in the code section.

**Fig. 4.** lambda.tz, which uses higher-order functions, and length.tz, which uses a measure function in the contract annotation.

expresses that it returns 4 if it is fed with a pair (3, 1). The ghost variables a and b are used in the annotations Assume (Line 8) and Assert (Line 10) in the body to denote the first and the second arguments of the pair passed to this function.

Helmholtz allows user-defined (recursive) functions to be used in annotations; these functions are called measure functions following the terminology of Liquid-Haskell [9]. The annotation Measure x : T<sup>1</sup> → T<sup>2</sup> where p<sup>1</sup> = e<sup>1</sup> | ··· | p<sup>n</sup> = e<sup>n</sup> defines a recursive function x that takes a value of type T1, destructs it by the pattern matching, and returns a value of type T2. Metavariables p and e represent ML-like patterns and expressions. The second contract in Figure 4, which computes the length of the list passed as a parameter, exemplifies the usage of the Measure annotation. This contract defines a measure function len that takes a list of integers and returns its type; it is used in ContractAnnot and LoopInv.

#### **4.2 Case Study: Contract with Signature Verification**

Figure 5 presents the code of the contract checksig.tz, which verifies that a sender indeed signed certain data using her private key. This contract uses instruction CHECK\_SIGNATURE, which is supposed to be executed under a stack of the form key  sig  bytes  tl, where key is a public key, sig is a signature, and bytes is some data. CHECK\_SIGNATURE pops these three values from the

**Fig. 5.** checksig.tz, which involves signature verification.

stack and pushes true if sig is the valid signature for bytes with the private key corresponding to key.

The intended behavior of checksig.tz is as follows. It stores a pair of an address addr, which is the address of a contract that takes a string parameter, and a public key key in its storage. It takes a pair (sig,s) of type pair signature string as a parameter where signature is the primitive Michelson type for signatures. This contract terminates without exception if sig is created from the serialized (packed) representation of s and signed by the private key corresponding to key. In a normal termination, this contract transfers 1 mutez to the contract with address addr. If this signature verification fails, then an exception is raised.

This behavior is expressed as a specification in the ContractAnnot annotation in checksig.tz as follows.


(Contract store.first) ], which represents an operation of transferring 1 mutez to the contract Contract store.first with the parameter param. second. The predicate sig and the constructor Pack are primitives of Helmholtz that can be used in an annotation.

**–** The refinement in the exception part expresses that if an exception is raised, then the signature verification should have failed (not (sig store.second param.first (Pack param.second))).

Helmholtz successfully verifies checksig.tz without any additional annotation in the code section. If we change the instruction ASSERT in Line 12 to DROP to let the contract drop the result of the signature verification (hence, an exception is not raised even if the signature verification fails), the verification fails as intended.

#### **4.3 Experiments**

We applied Helmholtz to various contracts; Table 1 is an excerpt of the result, in which we show (1) the number of the instructions in each contract (column #instr.) and (2) time (ms) spent to verify each contract. The experiments are conducted on MacOS Catalina 10.15.7 with Dual-Core Intel Core i5 (1.8 GHz), 8 GB RAM. We used Z3 version 4.8.8. The contracts boomerang.tz, deposit.tz, manager.tz, vote.tz, and reservoir.tz are taken from the benchmark of Micho-coq [3]. checksig.tz is derived from weather\_insurance.tz of the official Tezos test suite.<sup>8</sup> vote\_for\_delegate.tz and xcat.tz are taken from the official test suite; xcat.tz is simplified from the original. triangular\_num.tz is a simple test case that we made as an example of using LOOP. The source code of these contracts can be found at the Web interface of Helmholtz. Each contract is supposed to work as follows.


<sup>8</sup> https://gitlab.com/tezos/tezos/-/tree/ee2f75bb941522acbcf6d5065a9f3b2/ tests python/contracts/mini scenarios


In the experiments, we verified that each contract indeed works according to the intention explained above. triangular\_num.tz was the only contract that required a manual annotation for verification in the code section; we needed to specify a loop invariant in this contract.


**Table 1.** Benchmark result

Although the numbers of instructions in these contracts are not large, they capture essential features of smart contracts; everyone except triangular\_num.tz executes transactions; deposit.tz and manager.tz check the identity of the caller; and checksig.tz conducts signature verification. The time spent on verification is small.

#### **5 Related Work**

There are several publications on the formalization of programming languages for writing smart contracts. Hirai [7] formalizes EVM, a low-level smart contract language of Ethereum and its implementation, using Lem [13], a language to specify semantic definitions; definitions written in Lem can be compiled into definitions in Coq, HOL4, and Isabelle/HOL. Based on the generated definition, he verifies several properties of Ethereum smart contracts using Isabelle/HOL. Bernardo et al. [3] implemented Mi-Cho-Coq, a formalization of the semantics of Michelson using the Coq proof assistant. They also verified several Michelson contracts. Compared to their approach, we aim to develop an automated verification tool for smart contracts. Park et al. [14] developed a formal verification tool for EVM by using the K-framework [17], which can be used to derive a symbolic model checker from a formally specified language semantics (in this case, formalized EVM semantics [6]), and successfully applied the derived model checker to a few EVM contracts. It would be interesting to formalize the semantics of Michelson in the K-framework to compare Helmholtz with the derived model checker.

The DAO attack [18], mentioned in Section 1, is one of the notorious attacks on a smart contract. It exploits a vulnerability of a smart contract that is related to a callback. Grossman et al. [5] proposed a type-based technique to verify that execution of a smart contract that may contain callbacks is equivalent to another execution without any callback. This property, called effectively callback freedom, can be seen as one of the criteria for execution of a smart contract not to be vulnerable to the DAO-like attack. Their type system focuses on verifying the ECF property of execution of a smart contract, whereas ours concerns the verification of generic functional properties of a smart contract.

Benton proposes a program logic for a minimal stack-based programming language [2]. His program logic can give an assertion to a stack as our stack refinement types do. However, his language does not support first-class functions nor instructions for dealing with smart contracts (e.g., signature verification).

Our type system is an extension of the Michelson type system with refinement types, which have been successfully applied to various programming languages [16,22,9,10,20,26,23,24,25]. DTAL [25] is a notable example of an application of refinement types to an assembly language, a low-level language like Michelson. A DTAL program defines a computation using registers; we are not aware of refinement types for stack-based languages like Michelson.

We notice the resemblance between our type system and a program logic for PCF proposed by Honda and Yoshida [8], although the targets of verification are different. Their logic supports a judgment of the form A & e :<sup>u</sup> B, where e is a PCF program, A is a pre-condition assertion, B is a post-condition assertion, and u represents the value that e evaluates to and can be used in B, which resembles our type judgment in the formalization in Section 3. Their assertion language also incorporates a term expression f • x, which expresses the value resulting from the application of f to x; this expression resembles the formula call (t1, t2) = t<sup>3</sup> used in a refinement predicate. We have not noticed an automated verifier implemented based on their logic. Further comparison is interesting future work.

#### **6 Conclusion**

We described our automated verification tool Helmholtz for the smart contract language Michelson based on the refinement type system for Mini-Michelson. Helmholtz verifies whether a Michelson program follows a specification given in the form of a refinement type. We also demonstrated that Helmholtz successfully verifies various practical Michelson contracts.

Currently, Helmholtz supports approximately 80% of the whole instructions of the Michelson language. The definition of a measure function is limited in the sense that, for example, it can define only a function with one argument. We are currently extending Helmholtz so that it can deal with more programs.

Helmholtz currently verifies the behavior of a single contract, although a blockchain application often consists of multiple contracts in which contract calls are chained. To verify such an application as a whole, we plan to extend Helmholtz so that it can verify an inter-contract behavior compositionally by combining the verification results of each contract.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **SyReNN: A Tool for Analyzing Deep Neural Networks** *-*

Matthew Sotoudeh (-) and Aditya V. Thakur (-)

University of California, Davis CA 95616, USA {masotoudeh,avthakur}@ucdavis.edu

**Abstract.** Deep Neural Networks (DNNs) are rapidly gaining popularity in a variety of important domains. Formally, DNNs are complicated vector-valued functions which come in a variety of sizes and applications. Unfortunately, modern DNNs have been shown to be vulnerable to a variety of attacks and buggy behavior. This has motivated recent work in formally analyzing the properties of such DNNs. This paper introduces SyReNN, a tool for understanding and analyzing a DNN by computing its symbolic representation. The key insight is to decompose the DNN into linear functions. Our tool is designed for analyses using low-dimensional subsets of the input space, a unique design point in the space of DNN analysis tools. We describe the tool and the underlying theory, then evaluate its use and performance on three case studies: computing Integrated Gradients, visualizing a DNN's decision boundaries, and patching a DNN.

**Keywords:** Deep Neural Networks · Symbolic representation · Integrated Gradients

#### **1 Introduction**

Deep Neural Networks (DNNs) [18] have become the state-of-the-art in a variety of applications including image recognition [53,33] and natural language processing [12]. Moreover, they are increasingly used in safety- and security-critical applications such as autonomous vehicles [31] and medical diagnosis [10,38,28,37]. These advances have been accelerated by improved hardware and algorithms.

DNNs (Section 2) are programs that compute a vector-valued function, i.e., from R<sup>n</sup> to R<sup>m</sup>. They are straight-line programs written as a concatenation of alternating linear and non-linear layers. The coefficients of the linear layers are learned from data via gradient descent during a training process. A number of different non-linear layers (called activation functions) are commonly used, including the rectified linear and maximum pooling functions.

Owing to the variety of application domains and deployment constraints, DNNs come in many different sizes. For instance, large image-recognition and

<sup>-</sup> Artifact available at https://zenodo.org/record/4124489. Extended paper available at https://arxiv.org/abs/2101.03263.

<sup>©</sup> The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 281–302, 2021. https://doi.org/10.1007/978-3-030-72013-1 15

natural-language processing models are trained and deployed using cloud resources [33,12], medium-size models could be trained in the cloud but deployed on hardware with limited resources [31], and finally small models could be trained and deployed directly on edge devices [47,9,22,34,35]. There has also been a recent push to compress trained models to reduce their size [24]. Such smaller models play an especially important role in privacy-critical applications, such as wake word detection for voice assistants, because they allow sensitive user data to stay on the user's own device instead of needing to be sent to a remote computer for processing.

Although DNNs are very popular, they are not perfect. One particularly concerning development is that modern DNNs have been shown to be extremely vulnerable to adversarial examples, inputs which are intentionally manipulated to appear unmodified to humans but become misclassified by the DNN [54,19,40,8]. Similarly, fooling examples are inputs that look like random noise to humans, but are classified with high confidence by DNNs [41]. Mistakes made by DNNs have led to loss of life [36,17] and wrongful arrests [26,27]. For this reason, it is important to develop techniques for analyzing, understanding, and repairing DNNs.

This paper introduces SyReNN, a tool for understanding and analyzing DNNs. SyReNN implements state-of-the-art algorithms for computing precise symbolic representations of piecewise-linear DNNs (Section 3). Given an input subspace of a DNN, SyReNN computes a symbolic representation that decomposes the behavior of the DNN into finitely-many linear functions. SyReNN implements the one-dimensional analysis algorithm of Sotoudeh and Thakur [50] and extends it to the two-dimensional setting as described in Section 4.

*Key insights.* There are two key insights enabling this approach, first identified in Sotoudeh and Thakur [50]. First, most popular DNN architectures today are piecewise-linear, meaning they can be precisely decomposed into finitelymany linear functions. This allows us to reduce their analysis to equivalent questions in linear algebra, one of the most well-understood fields of modern mathematics. Second, many applications only require analyzing the behavior of the DNN on a low-dimensional subset of the input space. Hence, whereas prior work has attempted to give up precision for efficiency in analyzing highdimensional input regions [48,49,16], our work has focused on algorithms that are both efficient and precise in analyzing lower-dimensional regions (Section 4).

*Tool design.* The SyReNN tool is designed to be easy to use and extend, as well as efficient (Section 5). The core of SyReNN is written as a highly-optimized, parallel C++ server using Intel TBB for parallelization [45] and Eigen for matrix operations [23]. A user-friendly Python front-end interfaces with the PyTorch deep learning framework [44].

*Use cases.* We demonstrate the utility of SyReNN using three applications. The first computes Integrated Gradients (IG), a state-of-the-art measure used to determine which input dimensions (e.g., pixels for an image-recognition network) were most important in the final classification produced by the network (Section 6.1). The second precisely visualizes the decision boundaries of a DNN (Section 6.2). The last patches (repairs) a DNN to satisfy some desired specification involving infinitely-many points (Section 6.3). Thus, SyReNN is an interesting and useful tool in the toolbox for understanding and analyzing DNNs.

*Contributions.* The contributions of this paper are:


Section 2 presents preliminaries about DNNs; Section 7 presents related work; Section 8 concludes. SyReNN is available on GitHub at https://github.com/ 95616ARG/SyReNN.

#### **2 Preliminaries**

We now formally define the notion of DNN we will use in this paper.

**Definition 1.** <sup>A</sup> Deep Neural Network (DNN) is a function <sup>f</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup><sup>m</sup> which can be written f = f<sup>1</sup> ◦ f<sup>2</sup> ···◦ f<sup>n</sup> for a sequence of layer functions f1, f2, ..., fn.

Our work is primarily concerned with the popular class of piecewise-linear DNNs, defined below. In this definition and the rest of this paper, we will use the term "polytope" to mean a convex and bounded polytope except where specified.

**Definition 2.** A function <sup>f</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup><sup>m</sup> is piecewise-linear (PWL) if its input domain R<sup>n</sup> can be partitioned into finitely-many possibly-unbounded polytopes X1, X2,...,X<sup>k</sup> such that f-<sup>X</sup><sup>i</sup> is linear for every Xi.

The most common activation function used today is the ReLU function, a PWL activation function which is defined below.

**Definition 3.** The rectified linear function (ReLU) is a function ReLU : <sup>R</sup><sup>n</sup> <sup>→</sup> R<sup>m</sup> defined component-wise by

$$\text{RELU}(\vec{v})\_i := \begin{cases} 0 & \text{if } v\_i < 0\\ v\_i & \text{otherwise}, \end{cases}$$

where ReLU(v)<sup>i</sup> is the ith component of the vector ReLU(v) and v<sup>i</sup> is the ith component of the vector v.

In order to see that ReLU is PWL, we must show that its input domain R<sup>n</sup> can be partitioned such that, in each partition, ReLU is linear. In this case, we can use the orthants of R<sup>n</sup> as our partitioning: within each orthant, the signs of the components do not change hence ReLU is the linear function that just zeros out the negative components.

Fig. 1: Example function for which <sup>f</sup>-[−1,2] = {[−1, 0], [0, 1], [1, 2]}.

Although we focus on ReLU due to its popularity and expository power, SyReNN works with a number of other popular PWL layers include MaxPool, Leaky ReLU, Hard Tanh, Fully-Connected, and Convolutional layers, as defined in [18]. PWL layers have become exceedingly common. In fact, nearly all of the state-of-the-art image recognition models bundled with Pytorch [43] are PWL.

Example 1. The DNN <sup>f</sup> : <sup>R</sup><sup>1</sup> <sup>→</sup> <sup>R</sup><sup>1</sup> defined by

$$f(x) := \begin{bmatrix} 1 \ -1 \ -1 \end{bmatrix} \text{ReLU}\left( \begin{bmatrix} 1 & -1 \\ 1 & 0 \\ -1 & 0 \end{bmatrix} \begin{bmatrix} x \\ 1 \end{bmatrix} \right)$$

can be broken into layers f = f<sup>1</sup> ◦ f<sup>2</sup> ◦ f<sup>3</sup> where

$$f\_1(x) := \begin{bmatrix} 1 & -1 \\ 1 & 0 \\ -1 & 0 \end{bmatrix} \begin{bmatrix} x \\ 1 \end{bmatrix}, \quad f\_2 = \text{ReLU}, \quad \text{and} \quad f\_3(\vec{v}) = \begin{bmatrix} 1 \ -1 \ -1 \end{bmatrix} \vec{v}.$$

The DNN's input-output behavior on the domain [−1, 2] is shown in Figure 1.

#### **3 A Symbolic Representation of DNNs**

We formalize the symbolic representation according to the following definition:

**Definition 4.** Given a PWL function <sup>f</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup><sup>m</sup> and a bounded convex polytope <sup>X</sup> <sup>⊆</sup> <sup>R</sup><sup>n</sup>, we define the symbolic representation of <sup>f</sup> on <sup>X</sup>, written <sup>f</sup> 0-X, to be a finite set of polytopes f 0-<sup>X</sup> = {P1,...,Pn}, such that:


Notably, if f is a DNN using only PWL layers, then f is PWL and so we can define f 0-X. This symbolic representation allows one to reduce questions about the DNN f to questions about finitely-many linear functions Fi. For example, because linear functions are convex, to verify that ∀x ∈ X. f(x) ∈ Y for some polytope Y , it suffices to verify ∀P<sup>i</sup> ∈ f 0-<sup>X</sup>.∀v ∈ Vert(Pi). f(v) ∈ Y , where Vert(Pi) is the (finite) set of vertices for the bounded convex polytope Pi; thus, here both of the quantifiers are over finite sets. The symbolic representation described above can be seen as a generalization of the ExactLine representation [50], which considered only one-dimensional restriction domains of interest.

Example 2. Consider again the DNN <sup>f</sup> : <sup>R</sup><sup>1</sup> <sup>→</sup> <sup>R</sup><sup>1</sup> given by

$$f(x) := \begin{bmatrix} 1 \ -1 \ -1 \end{bmatrix} \operatorname{ReLU} \left( \begin{bmatrix} 1 & -1 \\ 1 & 0 \\ -1 & 0 \end{bmatrix} \begin{bmatrix} x \\ 1 \end{bmatrix} \right)$$

and the region of interest X = [−1, 2]. The input-output behavior of f on X is shown in Figure 1. From this, we can see that

$$\widehat{f\_{\restriction X}} = \{ [-1,0], [0,1], [1,2] \}.$$

Within each of these partitions, the input-output behavior is linear, which for <sup>R</sup><sup>1</sup> <sup>→</sup> <sup>R</sup><sup>1</sup> we can see visually as just a line segment. As this set fully partitions X, then, this is a valid f 0-X.

#### **4 Computing the Symbolic Representation**

This section presents an efficient algorithm for computing f 0-<sup>X</sup> for a DNN f composed of PWL layers. To retain both scalability and precision, we will require the input region X be two-dimensional. This design choice is relatively unexplored in the neural-network analysis literature (most analyses strike a balance between precision and scalability, ignoring dimensionality). We show that, for two-dimensional X, we can use an efficient polytope representation to produce an algorithm that demonstrates good best-case and in-practice efficiency while retaining full precision. This algorithm represents a direct generalization of the approach of [50].

The difficulties our algorithm addresses arise from three areas. First, when computing f 0-<sup>X</sup> there may be exponentially many such partitions on all of R<sup>n</sup> but only a small number of them may intersect with X. Consequently, the algorithm needs to be able to find those partitions that intersect with X efficiently without explicitly listing all of the partitions on R<sup>n</sup>. Second, it is often more convenient to specify the partitioning via hyperplanes separating the partitions than explicit polytopes. For example, for the one-dimensional ReLU function we may simply state that the line x = 0 separates the two partitions, because ReLU is linear both in the region <sup>x</sup> <sup>≤</sup> 0 and <sup>x</sup> <sup>≥</sup> 0. Finally, neural networks are typically composed of sequences of linear and piecewise-linear layers, where the partitioning imposed by each layer individually may be well-understood but their composition is more complex. For example, identifying the linear partitions

of <sup>y</sup> <sup>=</sup> ReLU(4 · ReLU(−3<sup>x</sup> <sup>−</sup> 1) + 2) is non-trivial, even though we know the linear partitions of each composed function individually.

Our algorithm only requires the user to specify the hyperplanes defining the partitioning for the activation function used in each layer; our current implementation comes with support for common PWL activation functions. For example, if a ReLU layer is used for an n-dimensional input vector, then the hyperplanes would be defined by the equations x<sup>1</sup> = 0, x<sup>2</sup> = 0,...,x<sup>n</sup> = 0. It then computes the symbolic representation for a single layer at a time, composing them sequentially to compute the symbolic representation across the entire network.

To allow such compositions of layers, instead of directly computing f 0-<sup>X</sup>, we will define another primitive, denoted by the operator ⊗ and sometimes referred to as Extend, such that

$$\text{Extension}(h, \widehat{g}) = h \otimes \widehat{g} = \widehat{h \circ g}. \tag{1}$$

Consider f = f<sup>n</sup> ◦ f<sup>n</sup>−<sup>1</sup> ◦···◦ f1, and let I : x → x be the identity map. I is linear across its entire input space, and, thus, I 0-<sup>X</sup> = {X}. By the definition of Extend(f1, ·), we have <sup>f</sup><sup>1</sup> <sup>⊗</sup> <sup>I</sup> 0-<sup>X</sup> = (f<sup>1</sup> ◦ <sup>I</sup>)-<sup>X</sup> = f 1-<sup>X</sup>, where the final equality holds by the definition of the identity map I. We can then iteratively apply this procedure to inductively compute (f<sup>i</sup> ◦···◦ f1)-<sup>X</sup> from (f<sup>i</sup>−1◦··· <sup>f</sup>1)-<sup>X</sup> like so:

$$(f\_i \otimes (f\_{i-1} \widehat{\circ \cdots \circ} f\_1)\_{\restriction X} = (f\_i \circ \widehat{f\_{i-1} \circ \cdots \circ} f\_1)\_{\restriction X}$$

until we have computed (f<sup>n</sup> ◦ <sup>f</sup><sup>n</sup>−<sup>1</sup> ◦···◦ <sup>f</sup>1)-<sup>X</sup> = f 0-<sup>X</sup>, which is the required symbolic representation.

#### **4.1 Algorithm for Extend**

Algorithm 1 present an algorithm for computing Extend for arbitrary PWL functions, where Extend(h, <sup>g</sup>\$) = <sup>h</sup> <sup>⊗</sup> <sup>g</sup>\$ <sup>=</sup> <sup>h</sup> ◦ <sup>g</sup>.

*Geometric intuition for the algorithm.* Consider the ReLU function (Definition 3). It can be shown that, within any orthant (i.e., when the signs of all coefficients are held constant), ReLU(x) is equivalent to some linear function, in particular the element-wise product of x with a vector that zeroes out the negative-signed components. However, for our algorithm, all we need to know is that the linear partitions of ReLU (in this case the orthants) are separated by hyperplanes x<sup>1</sup> = 0, x<sup>2</sup> = 0,...,x<sup>n</sup> = 0.

Given a two-dimensional convex bounded polytope X, the execution of the algorithm for f = ReLU can be visualized as follows. We pick some vertex v of X, and begin traversing the boundary of the polytope in counter-clockwise order. If we hit an orthant boundary (corresponding to some hyperplane x<sup>i</sup> = 0), it implies that the behavior of the function behaves differently at the points of the polytope to one side of the boundary from those at the other side of the boundary. Thus, we partition X into X<sup>1</sup> and X2, where X<sup>1</sup> lies to one side of the hyperplane and X<sup>2</sup> lies to the other side. We recursively apply this procedure to X<sup>1</sup> and X<sup>2</sup> until the resulting polytopes all lie on exactly one side of every hyperplane (orthant boundary). But lying on exactly one side of every hyperplane (orthant boundary) implies each polytope lies entirely within a linear partition of the function (a single orthant), hence the application of the function on that polytope is linear, and hence we have our partitioning.

*Functions used in algorithm.* Given a two-dimensional bounded convex polytope X, Vert(X) returns a list of its vertices in counter-clockwise order, repeating the initial vertex at the end. Given a set of points X, ConvexHull(X) represents their convex hull (the smallest bounded polytope containing every point in X). Given a scalar value x, Sign(x) computes the sign of that value (i.e., −1 if x < 0, +1 if x > 0, and 0 if x = 0).

*Algorithm description.* The key insight of the algorithm is to recursively partition the polytopes until such a partition lies entirely within a linear region of the function f. Algorithm 1 begins by constructing a queue containing the polytopes of <sup>g</sup>0-<sup>X</sup>. Each iteration either removes a polytope from the queue that lies entirely in one linear region (placing it in Y ), or splits (partitions) some polytope into two smaller polytopes that get put back into the queue. When we pop a polytope P from the queue, Line 6 iterates over all hyperplanes N<sup>k</sup> ·x = b<sup>k</sup> defining the piecewise-linear partitioning of f, looking for any for which some vertex V<sup>i</sup> lies on the positive side of the hyperplane and another vertex V<sup>j</sup> lies on the negative side of the hyperplane. If none exist (Line 7), by convexity we are guaranteed that the entire polytope lies entirely on one side with respect to every hyperplane, meaning it lies entirely within a linear partition of f. Thus, we can add it to Y and continue. If two such vertices are found (starting Line 10), then we can find "extreme" i and j indices such that V<sup>i</sup> is the last vertex in a counter-clockwise traversal to lie on the same side of the hyperplane as V<sup>1</sup> and V<sup>j</sup> is the last vertex lying on the opposite side of the hyperplane. We then call SplitPlane() (Algorithm 2) to actually partition the polytope on opposite sides of the hyperplane, adding both to our worklist.

In the best case, each partition is in a single orthant: the algorithm never calls SplitPlane() at all — it merely iterates over all of the n input partitions, checks their v vertices, and appends to the resulting set (for a best-case complexity of O(nv)). In the worst case, it splits each polytope in the queue on each face, resulting in exponential time complexity. As we will show in Section 6, this exponential worst-case behavior is not encountered in practice, thus making SyReNN a practical tool for DNN analysis.

Please see the extended version of this paper for a worked example of the algorithm's execution.

#### **4.2 Representing Polytopes**

We close this section with a discussion of implementation concerns when representing the convex polytopes that make up the partitioning of f 0-<sup>X</sup>. In standard computational geometry, bounded polytopes can be represented in two equivalent forms:

**Algorithm 1:** <sup>f</sup> <sup>⊗</sup> <sup>g</sup>0-<sup>X</sup> for two-dimensional X. f is defined by hyperplanes N<sup>1</sup> · x = b<sup>1</sup> through N<sup>m</sup> · x = b<sup>m</sup> such that, within any partition imposed by the hyperplanes f is equivalent to some affine function.

**Input:** <sup>g</sup>-<sup>X</sup> = {P1,...,Pn}. **Output:** <sup>f</sup>◦ <sup>g</sup>-X **<sup>1</sup>** <sup>W</sup> <sup>←</sup> ConstructQueue(<sup>g</sup>-<sup>X</sup>) **<sup>2</sup>** Y ← ∅ **3 while** W not empty **do <sup>4</sup>** P ← Pop(W) **<sup>5</sup>** V ← Vert(P) **<sup>6</sup>** K ← {N<sup>k</sup> | ∃i, j : Sign(N<sup>k</sup> · g(Vi) − bk) > 0 ∧ Sign(N<sup>k</sup> · g(V<sup>j</sup> ) − bk) < 0} **<sup>7</sup> if** K = ∅ **then <sup>8</sup>** Y ← Y ∪ {P} **9 continue <sup>10</sup>** N, b ← any element from K **<sup>11</sup>** i ← arg maxi{Sign(N · g(Vi) − b) = Sign(N · g(V1) − b)} **<sup>12</sup>** j ← arg maxj{Sign(N · g(V<sup>j</sup> ) − b) = Sign(N · g(Vi) − b)} **<sup>13</sup> for** V ∈ SplitPlane(V, g, i, j, N, b) **do <sup>14</sup>** W ← Push(W, ConvexHull(V )) **15 return** Y


Certain operations are more efficient when using one representation compared to the other. For example, finding the intersection of two polytopes in an Hrepresentation can be done in linear time by concatenating their representative half-spaces, but the same is not possible in V-representation.

There are two main operations on polytopes we need perform in our algorithms: (i) splitting a polytope with a hyperplane, and (ii) applying an affine map to all points in the polytope. In general, the first is more efficient in an H-representation, while the latter is more efficient in a V-representation. However, when restricted to two-dimensional polygons, the former is also efficient in a V-representation, as demonstrated by Algorithm 2, helping to motivate our use of the V-representation in our algorithm.

Furthermore, the two polytope representations have different resiliency to floating-point operations. In particular, H-representations for polytopes in R<sup>n</sup> are notoriously difficult to achieve high-precision with, because the error introduced from using floating point numbers gets arbitrarily large as one goes in a particular direction along any hyperplane face. Ideally, we would like the



hyperplane to be most accurate in the region of the polytope itself, which corresponds to choosing the magnitude of the norm vector correctly. Unfortunately, to our knowledge, there is no efficient algorithm for computing the ideal floating point H-representation of a polytope, although libraries such as APRON [30] are able to provide reasonable results for low-dimensional spaces. However, because neural networks utilize extremely high-dimensional spaces (often hundreds or thousands of dimensions) and we wish to iteratively apply our analysis, we find that errors from using floating-point H-representations can quickly multiply and compound to become infeasible. By contrast, floating-point inaccuracies in a V-representation are directly interpretable as slightly misplacing the vertices of the polytope; no "localization" process is necessary to penalize inaccuracies close to the polytope more than those far away from it.

Another difference is in the space complexity of the representation. In general, H-representations can be more space-efficient for common shapes than Vrepresentations. However, when the polytope lies in a low-dimensional subspace of a larger space, the V-representation is usually significantly more efficient.

Thus, V-representations are a good choice for low-dimensionality polytopes embedded in high-dimensional space, which is exactly what we need for analyzing neural networks with two-dimensional restriction domains of interest. This is why we designed our algorithms to rely on Vert(X), so that they could be directly computed on a V-representation.

The 2D algorithm described above can be seen as implementing the recursive case of a more general, n-dimensional version of the algorithm that recurses on each of the (n − 1)-dimensional facets. Please see the extended version of this paper for more details.

#### **5 SyReNN tool**

This section provides more details about the design and implementation of our tool, SyReNN (Symbolic Representations of Neural Networks), which computes f 0-<sup>X</sup>, where f is a DNN using only piecewise-linear layers and X is a union of one- or two-dimensional polytopes. The tool is available under the MIT license at https://github.com/95616ARG/SyReNN and in the PyPI package pysyrenn.

*Input and output format.* SyReNN supports reading DNNs from two standard formats: ERAN (a textual format used by the ERAN project [1]) as well as ONNX (an industry-standard format supporting a wide variety of different models) [42]. Internally, the input DNN is described as an instance of the Network class, which is itself a list of sequential Layers. A number of layer types are provided by SyReNN, including FullyConnectedLayer, ConvolutionalLayer, and ReLULayer. To support more complicated DNN architectures, we have implemented a ConcatLayer, which represents a concatenation of the output of two different layers. The input region of interest, X, is defined as a polytope described by a list of its vertices in counter-clockwise order. The output of the tool is the symbolic representation f 0-X.

*Overall Architecture.* We designed SyReNN in a client-server architecture using gRPC [20] and protocol buffers [21] as a standard method of communication between the two. This architecture allows the bulk of the heavy computation to be done in efficient C++ code, while allowing user-friendly interfaces in a variety of languages. It also allows practitioners to run the server remotely on a more powerful machine if necessary. The C++ server implementation uses the Intel TBB library for parallelization. Our official front-end library is written in Python, and available as a package on PyPI so installation is as simple as pip install pysyrenn. The entire project can be built using the Bazel build system, which manages dependencies using checksums.

*Server Architecture.* The major algorithms are implemented as a gRPC server written in C++. When a connection is first made, the server initializes the state with an empty DNN f(x) = x. During the session, three operations are permitted: (i) append a layer g so that the current session's DNN is updated from f<sup>0</sup> to f1(x) := g(f0(x)), (ii) compute f 0-<sup>X</sup> for a one-dimensional X, or (iii) compute f 0-<sup>X</sup> for a two-dimensional X. We have separate methods for one- and two-dimensional X, because the one-dimensional case has specific optimizations for controlling memory usage. The SegmentedLine and UPolytope types are used to represent one- and two-dimensional partitions of X, respectively. When operation (1) is performed, a new instance of the LayerTransformer class is initialized with the relevant parameters and added to a running vector of the current layers. When operation (2) is performed, a new queue of SegmentedLines is constructed, corresponding to X, and the before-allocated LayerTransformers are applied sequentially to compute f 0-<sup>X</sup>. In this case, extra control is provided to automatically gauge memory usage and pause computation for portions of X until more memory is made available. Finally, when operation (3) is a performed, a new instance of UPolytope is initialized with the vertices of X and the LayerTransformers are again applied sequentially to compute f 0-X.

*Client Architecture.* Our Python client exposes an interface for defining DNNs similar to the popular Sequential-Network Keras API [11]. Objects represent individual layers in the network, and they can be combined sequentially into a Network instance. The key addition of our library is that this Network exposes methods for computing f 0-<sup>X</sup> given a V-representation description of X. To do this, it invokes the server and passes a layer-by-layer description of f followed by the polytope X, then parses the response f 0-X.

*Extending to support different layer types.* Different layer types are supported by sub-classing the LayerTransformer class. Instances of this class expose a method for computing Extend(h, ·) for the corresponding layer <sup>h</sup>. To simplify implementation, two sub-classes of LayerTransformer are provided: one for entirely-linear layers (such as fully-connected and convolutional layers), and one for piecewise-linear layers. For fully-linear layers, all that needs to be provided is a method computing the layer function itself. For piecewise-linear layers, two methods need to be provided: (1) computing the layer function itself, and (2) one describing the hyperplanes which separate the linear regions. The base class then directly implements Algorithm 1 for that layer. This architecture makes supporting new layers a straight-forward process.

*Float Safety.* Like Reluplex [32], SyReNN uses floating-point arithmetic to compute f 0-<sup>X</sup> efficiently. Unfortunately, this means that in some cases its results will not be entirely precise when compared to a real-valued or multiple-precision version of the algorithm. Approaches for addressing this are discussed in the extended version of this paper.

#### **6 Applications of SyReNN**

This section presents the use of SyReNN in three example case studies.

#### **6.1 Integrated Gradients**

A common problem in the field of explainable machine learning is understanding why a DNN made the prediction it did. For example, given an image classified by a DNN as a 'cat,' why did the DNN decide it was a cat instead of, say, a dog? Were there particular pixels which were particularly important in deciding this? Integrated Gradients (IG) [52] is the state-of-the-art method for computing such model attributions.

**Definition 5.** Given a DNN f, the integrated gradients along dimension i for input x and baseline x is defined to be:

$$IG\_i(x) \stackrel{def}{=} (x\_i - x\_i') \times \int\_{\alpha = 0}^1 \frac{\partial f(x' + \alpha \times (x - x'))}{\partial x\_i} d\alpha. \tag{2}$$

The computed value IGi(x) determines relatively how important the ith input (e.g., pixel) was to the classification.

However, exactly computing this integral requires a symbolic, closed form for the gradient of the network. Until [50], it was not known how to compute such a closed-form and so IGs were always only approximated using a samplingbased approach. Unfortunately, because it was unknown how to compute the true value, there was no way for practitioners to determine how accurate their approximations were. This is particularly concerning in fairness applications where an accurate attribution is exceedingly important.

In [50], it was recognized that, when X = ConvexHull({x, x }), f 0-<sup>X</sup> can be used to exactly compute IGi(x). This is because within each partition of f 0-X the gradient of the network is constant because it behaves as a linear function, and hence the integral can be written as the weighted sum of such finitelymany gradients.<sup>1</sup> Using our symbolic representation, the exact IG can thus be computed as follows:

$$\sum\_{\text{Convexball1}(\{y\_i, y\_i'\}) \in \widehat{f\_{\lfloor \text{Convexball1}(\{x\_i, x'\})}}} (y\_i' - y\_i) \times \frac{\partial f(0.5 \times (y\_i + y\_i'))}{\partial x\_i} \tag{3}$$

Where here yi, y <sup>i</sup> are the endpoints of the segment with y<sup>i</sup> closer to x and y i closest to x .

*Implementation.* The helper class IntegratedGradientsHelper is provided by our Python client library. It takes as input a DNN f and a set of (x, x ) input-baseline pairs and then computes IG for each pair.

*Empirical Results.* In [50] SyReNN was used to show conclusively that existing sampling-based methods were insufficient to adequately approximate the true IG. This realization led to changes in the official IG implementation to use the more-precise trapezoidal sampling method we argued for.

*Timing Numbers.* In those experiments, we used SyReNN to compute f 0-X for three different DNNs f, namely the small, medium, and large convolutional models from [1]. For each DNN, we ran SyReNN on 100 one-dimensional lines. The 100 calls to SyReNN completed in 20.8 seconds for the small model, 183.3 for the medium model, and 615.5 for the big model. Tests were performed on an Intel Core i7-7820X CPU at 3.60GHz with 32GB of memory.

#### **6.2 Visualization of DNN Decision Boundaries**

Whereas IG helps understand why a DNN made a particular prediction about a single input point, another major task is visualizing the decision boundaries of a DNN on infinitely-many input points. Figure 2 shows a visualization of an ACAS Xu DNN [31] which takes as input the position of an airplane and an approaching attacker, then produces as output one of five advisories instructing the plane, such as "clear of conflict" or to move "weak left." Every point in the diagram represents the relative position of the approaching plane, while the color indicates the advisory.

<sup>1</sup> As noted in [50], this technically requires a slight strengthening of the definition of f -<sup>X</sup> which is satisfied by our algorithms as defined above.

Fig. 2: Visualization of decision boundaries for the ACAS Xu network. Using SyReNN (left) quickly produces the exact decision boundaries. Using abstract interpretation-based tools like DeepPoly (middle and right) are slower and produce only imprecise approximations of the decision boundaries.

One approach to such visualizations is to simply sample finitely-many points and extrapolate the behavior on the entire domain from those finitely-many points. However, this approach is imprecise and risks missing vital information because there is no way to know the correct sampling density to use to identify all important features.

Another approach is to use a tool such as DeepPoly [49] to over-approximate the output range of the DNN. However, because DeepPoly is a relatively coarse over-approximation, there may be regions of the input space for which it cannot state with confidence the decision made by the network. In fact, the approximations used by DeepPoly are extremely coarse. A na¨ıve application of DeepPoly to this problem results in it being unable to make claims about any of the input space of interest. In order to utilize it, we must partition the space and run DeepPoly within each partition, which significantly slows down the analysis. Even when using 25<sup>2</sup> partitions, Figure 2b shows that most of the interesting region is still unclassifiable with DeepPoly (shown in white). Only with 100<sup>2</sup> partitions can DeepPoly effectively approximate the decision boundaries, although it is still quite imprecise.

By contrast, f 0-<sup>X</sup> can be used to exactly determine the decision boundaries on any 2D polytope subset of the input space, which can then be plotted. This is shown in Figure 2a. Furthermore, as shown in Table 1, the approach using f 0-X is significantly faster than that using ERAN, even as we get the precise answer instead of an approximation. Such visualizations can be particularly helpful in identifying issues to be fixed using techniques such as those in Section 6.3.

Table 1: Comparing the performance of DNN visualization using SyReNN versus DeepPoly for the ACAS Xu network [31]. f 0-<sup>X</sup> size is the number of partitions in the symbolic representation. SyReNN time is the time taken to compute f 0-X using SyReNN. DeepPoly[k] time is the time taken to compute DeepPoly for approximating decision boundaries with k partitions. Each scenario represents a different two-dimensional slice of the input space; within each slice, the heading of the intruder relative to the ownship along with the speed of each involved plane is fixed.


*Implementation.* The helper class PlanesClassifier is provided by our Python client library. It takes as input a DNN f and an input region X, then computes the decision boundaries of f on X.

*Timing Numbers.* Timing comparisons are given in Table 1. We see that SyReNN is quite performant, and the exact SyReNN can be computed more quickly than even a mediocre approximation from DeepPoly using 55<sup>2</sup> partitions. Tests were performed on a dedicated Amazon EC2 c5.metal instance, using BenchExec [5] to limit the number of CPU cores to 16 and RAM to 16GB.

#### **6.3 Patching of DNNs**

We have now seen how SyReNN can be used to visualize the behavior of a DNN. This can be particularly useful for identifying buggy behavior. For example, in Figure 2a we can see that the decision boundary between "strong right" and "strong left" is not symmetrical.

The final application we consider for SyReNN is patching DNNs to correct undesired behavior. Patching is described formally in [51]. Given an initial network N and a specification φ describing desired constraints on the input/output, the goal of patching is to find a small modification to the parameters of N producing a new DNN N that satisfies the constraints in φ.

The key theory behind DNN patching we will use was developed in [51]. The key realization of that work is that, for a certain DNN architecture, correcting the network behavior on an infinite, 2D region X is exactly equivalent to correcting

Fig. 3: Network patching.

its behavior on the finitely-many vertices Vert(Pi) for each of the finitely-many P<sup>i</sup> ∈ f 0-<sup>X</sup>. Hence, SyReNN plays a key role in enabling efficient DNN patching.

For this case study, we patched the same aircraft collision-avoidance DNN visualized in Section 6.2. We patched the DNN three times to correct three different buggy behaviors of the network: (i) remove "Pockets" of strong left/strong right in regions that are otherwise weak left/weak right; (ii) remove the "Bands" of weak-left advisory behind and to the left of the plane; and (iii) enforce "Symmetry" across the horizontal. The DNNs before and after patching with different specifications are shown in Figure 3.

*Implementation* The helper class NetPatcher is provided by our Python client library. It takes as input a DNN f and pairs of input region, output label Xi, Yi, then computes a new DNN f which maps all points in each X<sup>i</sup> into Yi.

*Timing Numbers.* As in Section 6.2, computing f 0-<sup>X</sup> for use in patching took approximately 10 seconds.

#### **7 Related Work**

The related problem of exact reach set analysis for DNNs was investigated in [58]. However, the authors use an algorithm that relies on explicitly enumerating all exponentially-many (2<sup>n</sup>) possible signs at each ReLU layer. By contrast, our algorithm adapts to the actual input polytopes, efficiently restricting its consideration to activations that are actually possible.

Hanin and Rolnick [25] prove theoretical properties about the cardinality of f 0-<sup>X</sup> for ReLU networks, showing that <sup>|</sup><sup>f</sup> 0-<sup>X</sup>| is expected to grow polynomially with the number of nodes in the network for randomly-initialized networks.

Thrun [55] and Bastani et al.[4] extract symbolic rules meant to approximate DNNs, which can approximate the symbolic representation f 0-X.

In particular, the ERAN [1] tool and underlying DeepPoly [49] domain were designed to verify the non-existence of adversarial examples. Breutel et al. [6] give an iterative refinement algorithm for an overapproximation of the weakest precondition as a polytope where the required output is also a polytope.

Scheibler et al. [46] verify the safety of a machine-learning controller using the SMT-solver iSAT3, but support small unrolling depths and basic safety properties. Zhu et al. [60] use a synthesis procedure to generate a safe deterministic program that can enforce safety conditions by monitoring the deployed DNN and preventing potentially unsafe actions. The presence of adversarial and fooling inputs for DNNs as well as applications of DNNs in safety-critical systems has led to efforts to verify and certify DNNs [3,32,14,29,16,7,57,49,2]. Approximate reachability analysis for neural networks safely overapproximates the set of possible outputs [16,58,59,57,13,56].

Prior work in the area of network patching focuses on enforcing constraints on the network during training. DiffAI [39] is an approach to train neural networks that are certifiably robust to adversarial perturbations. DL2 [15] allows for training and querying neural networks with logical constraints.

#### **8 Conclusion and Future Work**

We presented SyReNN, a tool for understanding and analyzing DNNs. Given a piecewise-linear network and a low-dimensional polytope subspace of the input subspace, SyReNN computes a symbolic representation that decomposes the behavior of the DNN into finitely-many linear functions. We showed how to efficiently compute this representation, and presented the design of the corresponding tool. We illustrated the utility of SyReNN on three applications: computing exact IG, visualizing the behavior of DNNs, and patching (repairing) DNNs.

In contrast to prior work, SyReNN explores a unique point in the design space of DNN analysis tools. Instead of trading off precision of the analysis for efficiency, SyReNN focuses on analyzing DNN behavior on low-dimensional subspaces of the domain, for which we can provide both efficiency and precision.

We plan on extending SyReNN to make use of GPUs and other massivelyparallel hardware to more quickly compute f 0-<sup>X</sup> for large f or X. Techniques to support input polytopes that are greater than two dimensional is also a ripe area of future work. We may also be able to take advantage of the fact that nonconvex polytopes can be represented efficiently in 2D. Extending algorithms for f 0-<sup>X</sup> to handle architectures such as Recurrent Neural Networks (RNNs) will open up new application areas for SyReNN.

Acknowledgements. We thank the anonymous reviewers for their feedback and suggestions on this work. This material is based upon work supported by a Facebook Probability and Programming award.

#### **References**

1. ETH robustness analyzer for neural networks (ERAN). https://github.com/ eth-sri/eran (2019), accessed: 2019-05-01


M.H.S., Boca, S.M., Swamidass, S.J., Huang, A., Gitter, A., Greene, C.S.: Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface **15**(141), 20170387 (2018). https://doi.org/10.1098/rsif.2017.0387


Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. pp. 1106–1114 (2012), http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **MachSMT: A Machine Learning-based Algorithm Selector for SMT Solvers***-*

Joseph Scott1(-) , Aina Niemetz<sup>2</sup> , Mathias Preiner<sup>2</sup> , Saeed Nejati<sup>1</sup> , and Vijay Ganesh<sup>1</sup>

<sup>1</sup> University of Waterloo, Waterloo, Ontario, Canada {joseph.scott, snejati, vijay.ganesh}@uwaterloo.ca <sup>2</sup> Stanford University, Stanford, USA {niemetz,preiner}@cs.stanford.edu

**Abstract.** In this paper, we present MachSMT, an algorithm selection tool for Satisfiability Modulo Theories (SMT) solvers. MachSMT supports the entirety of the SMT-LIB language. It employs machine learning (ML) methods to construct both empirical hardness models (EHMs) and pairwise ranking comparators (PWCs) over state-of-the-art SMT solvers. Given an SMT formula I as input, MachSMT leverages these learnt models to output a ranking of solvers based on predicted run time on the formula I. We evaluate MachSMT on the solvers, benchmarks, and data obtained from SMT-COMP 2019 and 2020. We observe MachSMT frequently improves on competition winners, winning 54 divisions outright and up to a 198.4% improvement in PAR-2 score, notably in logics that have broad applications (e.g., BV, LIA, NRA, etc.) in verification, program analysis, and software engineering. The MachSMT tool is designed to be easily tuned and extended to any suitable solver application by users. MachSMT is not a replacement for SMT solvers by any means. Instead, it is a tool that enables users to leverage the collective strength of the diverse set of algorithms implemented as part of these sophisticated solvers.

**Keywords:** SMT Solvers · Machine Learning · Algorithm Selection

#### **1 Introduction**

Satisfiability Modulo Theories (SMT) solvers are tools to decide the satisfiability of formulas over first-order theories such as bit-vectors, floating-point arithmetic, integers, reals, strings, arrays, and their combinations [44,9,24,18,47,20,46]. In recent years, SMT solvers have had a revolutionary impact on applications in software engineering (broadly construed), such as software testing [17,48] and verification [23,15,27,39], as well as in sub-fields of AI [53,35,30]. This impact is a driver for an insatiable demand for evermore efficient solvers, not only to scale to larger instances obtained from existing applications (e.g., automatic bug-finding

<sup>-</sup> This work was supported in part by DARPA (award no. FA8650-18-2-7861) and ONR (award no. N68335-17-C-0558).

<sup>©</sup> The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 303–325, 2021. https://doi.org/10.1007/978-3-030-72013-1 16

in commercial software [26,4]), but also to solve problems from new application domains (e.g., verification and synthesis of cryptographic primitives [13]).

**Motivation for Algorithm Selection for SMT Solvers.** In response to this high demand, the SMT community has developed a plethora of solver heuristics and configurations. For example, in the 2019 edition of the annual SMT-COMP competition [10,31], more than 50 solvers and their configurations were submitted. Many of these solvers implement very different algorithms to tackle the satisfiability problem for (a combination of) first-order theories, with significantly varying performance profiles. For example, in the quantifier-free theory of floating-point arithmetic (QF FP), there exist several substantially different decision procedures, e.g., bit-blasting [16], abstract CDCL [14], interreduction methods [55], and reduction to global optimization [22,11]. In this specific setting of floating-point solvers, input instances may be derived from a variety of applications, such as software verification or analysis of machine learning (ML) models [56]. In such a scenario, a very natural question arises: which solver or configuration is best for a given input instance?

Another well-known issue with many SMT solvers (even state-of-the-art ones) is that users may not know a priori which formula features or encoding would make an instance easy to solve. This can be very frustrating for users as they have to try a large number of different encodings with different solver configurations before they can figure out which combination works best for their specific scenario, which may result in a combinatorial explosion. Users have also noted that as their applications change, what was once a great solver configuration in an earlier setting is suddenly not very good in the newer one. One possible approach to address this problem is to use a portfolio of solvers, just as has been successfully done in the context of SAT solvers. Unfortunately, given the plethora of solvers (more than 50 in SMT-COMP 2019 and 2020) and configurations (CVC4 [9] alone utilizes 23 different configurations in a sequential portfolio setting for quantified logics) such an approach becomes quickly infeasible in the SMT solver setting.

**Brief Overview of MachSMT.** One way to address the above-mentioned problems is to use an automated algorithm-selection tool that can automatically and with high accuracy predict the best algorithm from a given set of algorithms for a specific input. Such a tool selects the best SMT solver from a set of solvers for a given SMT formula. To this end, we introduce MachSMT, a machine learning-based algorithm-selection tool. MachSMT supports the entirety of the SMT-LIB language [8]. It takes as input an instance for a specified theory of interest, and outputs a ranking of solvers predicted to have the lowest runtime. Internally, MachSMT is a set of machine learnt models constructed by analyzing the runtimes of solver configurations on benchmarks with respect to the frequencies of grammatical constructs (e.g., predicates, functions, rounding modes, etc.). Additionally, it defines other syntactical properties that can have influence in performance (e.g., quantifier nesting levels).

At a high-level, MachSMT works as follows. At its core, MachSMT uses two techniques to perform algorithm selection: empirical hardness models (EHMs) and pairwise ranking comparators (PWCs). MachSMT uses frequencies of grammatical constructs from the SMT-LIB language [8], in addition to several other syntactical metrics for features pipelined with Principal Component Analysis (PCA) and AdaBoosting to construct its empirical hardness models and comparators.

An EHM for a given solver S is a mapping from an input instance I to a predicted runtime of S on I. At runtime, given I, MachSMT queries all EHMs for all solvers (that were considered during training) over I, and outputs a ranking of solvers based on their predicted runtimes (top-ranked solver is predicted to solve the input problem the fastest). By contrast, a learnt pairwise ranking comparator (PWC) is a mapping that takes as input pair (S1, S2) of solvers and an input instance I, and outputs a ranking over the input solvers based on which one of them is predicted to have a lower runtime on I (denoted as S<sup>1</sup> ≤ S<sup>2</sup> or S<sup>1</sup> ≥ S2). During evaluation, given an input instance I, MachSMT uses the learnt PWC as a comparator to rank the set of solvers.

While algorithm selection has been considered in the broad setting of solvers (e.g., QBF solvers [50] and SAT solvers [67]) as well as certain specific SMT theories [57,5,64], we are not aware of previous work on algorithm selection aimed at the entirety of SMT-LIB [7]. Our results demonstrate that the MachSMT algorithm selector is highly effective, in that it outperforms the competition winners on the majority of tracks from the SMT-COMP in 2019 and 2020.

Perhaps the first algorithm selection tool in the context of logic solvers was SATZilla [67]. Since its introduction, SATZilla has had a tremendous impact on SAT solver research, winning multiple gold medals in the SAT competitions. Having said that, there are several significant differences between MachSMT and SATZilla. Briefly, SATZilla deploys a feature selection scheme to avoid the curse of dimensionality, while MachSMT leverages a learnt dimensionality reduction scheme, namely, Principal Component Analysis (PCA). In fact, a feature selection scheme would simply not scale in the context of SMT solvers given the very large number of learnt models that are incorporated into MachSMT. We discuss additional differences between SATZilla and MachSMT at length in Section 6.

It goes without saying that MachSMT is only as powerful as the underlying solvers that it has access to. MachSMT is clearly not a replacement for any particular SMT solver, but rather a tool that enables users to leverage the collective strength of the diverse set of algorithms and configurations implemented as part of these sophisticated solvers.

#### **Contributions.**

We make the following contributions in this paper.

1. **The MachSMT Algorithm Selection Tool**. We present the MachSMT tool, an algorithm selection tool for the entirety of SMT-LIB. MachSMT uses machine learning (ML) to construct EHMs and PWCs of solvers for algorithm selection. A key feature of MachSMT tool is that it is designed to be easily tuned and extended by SMT solver users (Section 3).

2. **Analysis of MachSMT over SMT-COMP 2019 and 2020 Benchmarks and Solvers**. We perform an extensive experimental analysis of MachSMT across all divisions from SMT-COMP 2019 and 2020. We observe that MachSMT improves on competition winners in 54 divisions, with up to 198.4% improvement in performance for the QF BVFPLRA SQ '20 and up to 191.1% for the QF BVFP SQ '20 division. We provide our learnt models, used in our experimentation, for ease of use and transparency. While building learnt models for MachSMT can be computationally expensive (a one time cost), installing, downloading, and using our models is easy (Section 4). All source code and learnt models from our experience can be found at: https://github.com/j29scott/MachSMT. The artifact is available at: https://zenodo.org/record/4458699.

The rest of this paper is structured as follows. Section 2 provides the necessary background, Section 3 gives a technical description of MachSMT, Section 4 gives an experimental evaluation of MachSMT over SMT-COMP 2019 and 2020, Section 5 provides an analysis of the experimental results, Section 6 describes related work, and Section 7 concludes the paper and discusses future work.

### **2 Background**

In this section, we provide some background on algorithm selection via EHMs and PWCs, and the machine learning methods we use, such as principal component analysis (PCA) and k-fold cross validation.

#### **2.1 A Brief Overview of Algorithm Selection**

The idea of algorithm selection was first proposed and formalized by Rice et. al. [51] in 1976. Researchers have long known that given a set of different algorithms and implementations for the same specification or problem, it is often the case that one of these implementations may perform poorly on a given class of inputs while another might perform very well. This is especially true for problems believed to be computationally hard (e.g., NP-hard). The reasons for this phenomenon could be as diverse as choice of data structures, fundamental differences between algorithms, or the fact that heuristics implemented as part of one algorithm can exploit the input problem structure or the underlying hardware better than the others.

It is natural to want to exploit the diversity in algorithmic approaches to minimize the cumulative runtimes. However, in practice users often deploy greedy algorithm selection – picking the best observed algorithm based on empirical analysis and testing. However, greedy algorithm selection can be sub-optimal when the best empirical algorithm has deficiencies relative to other algorithms on certain families of inputs.

With the recent advances in AI and ML, researchers are beginning to leverage these new technologies to advance algorithm selection. To the best of our knowledge, there are two key approaches for ML-driven algorithm selection in the context of constraint solvers: through the use of Empirical Hardness Models (EHMs), and through Pairwise Ranking Comparators (PWCs).

**Algorithm Selection via Empirical Hardness Models (EHMs):** Let I be an input in the language of <sup>S</sup> with a corresponding feature vector <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup>. For an algorithm <sup>s</sup> ∈ S, an EHM is a learnt function <sup>f</sup><sup>s</sup> : <sup>R</sup><sup>n</sup> <sup>→</sup> <sup>R</sup> that predicts the runtime of s on I. An EHM is constructed with an ML regression model trained on collected runtime data. The algorithm is then selected by computing:

> argmin s∈S fs(x)

**Algorithm Selection via Pairwise Ranking Comparators (PWCs).** Let P be the set of all unique pair sets (sets of size two). For each p = (Si, S<sup>j</sup> ) ∈ P, construct a learnt comparator <sup>f</sup><sup>p</sup> : <sup>R</sup><sup>n</sup> → {0, <sup>1</sup>}, that returns 0 if algorithm <sup>S</sup><sup>i</sup> solves I faster than S<sup>j</sup> , and 1 otherwise. For an input I with a feature vector x, we compute a ranking of algorithms as a map r over S, where for s ∈ S, r[s] is the ranking of solver s that represents: "how many solvers in S are faster than s in solving the input S", or more formally: r[s] = Σ<sup>p</sup>:s∈<sup>p</sup>fs(x). The selected solver is then the minimum ranked solver, i.e.,

$$\underset{s \in \mathcal{S}}{\text{argmin }} \; r[s]$$

#### **2.2 Supervised Learning, Adaptive Boosting, Curse of Dimensionality, and K-Fold Cross-Validation**

Supervised learning is one of the most predominant areas of ML. Supervised learning takes as input a dataset of features X and labels Y , and each datapoint <sup>x</sup> <sup>∈</sup> <sup>X</sup> has a label <sup>y</sup> <sup>∈</sup> <sup>Y</sup> . A datapoint is a real valued vector <sup>x</sup> <sup>∈</sup> <sup>R</sup><sup>n</sup> describing a sample. The learning problem is said to be a classification problem if the labels y ∈ Y come from a fixed and finite set of classes C (e.g., a set of algorithms). Alternatively, the learning problem is a regression problem if the labels are real valued (e.g., runtimes).

One efficient and effective approach to supervised learning is Adaptive Boosting (AdaBoost). AdaBoost is an ensemble approach to machine learning invented by Freund and Schapire et. al. [21], which won the G¨odel Prize in 2003. In ensemble learning, a set of learning algorithms (e.g., weak learners) are trained, and predictions are made diplomatically across the set. In this paper, we exclusively consider AdaBoost to solve both the classification and regression problems for algorithm selection. We use an ensemble of 200 decision trees in the AdaBoost algorithm. For more, we refer to Drucker et al. [19].

While supervised learning has had tremendous impacts in several areas of research, there are pitfalls, such as the curse of dimensionality (CoD). Consider the convex polytope P formed around the convex hull of X. The volume of P

increases exponentially with the dimensionality of X requiring an exponential amount of datapoints to avoid extreme sparsity in X. Sparsity in datasets is one of the leading causes of poor performances in learnt models [28]. There is a large literature on managing the CoD. In this paper, we discuss feature selection and deploy dimensionality reduction solutions. In feature selection, a new dataset X is computed from X by selecting the subset of features that are the most performant on a validation dataset. Feature selection was deployed in the successful SATZilla algorithm selection tool for Boolean satisfiability.

Despite the success of feature selection in SATZilla, feature selection does have some flaws. First, there is a significant loss of information. In the case of SATZilla, a feature vector composed of more than a hundred values describing an input is reduced to just five values. Second, the total number of feature subsets is exponential in the number of features. While there has been a great deal of research in reducing the time spent searching for high performing subsets [65,36], in our experiments, we found it to be the most computationally taxing component of the SATZilla framework.

When evaluating the performance of a supervised learning model, a training set is used to construct the learnt model and a testing set is set aside to evaluate. However, this method alone can be prone to overfitting and selection bias [54,43]. Instead, researchers often use k−fold cross-validation to evaluate their learnt models. In k−fold cross validation, the dataset is split into k sets, and the learnt model is trained on k − 1 sets and is evaluated on the set that is left out. This process is repeated k times so each set gets evaluated.

#### **2.3 Unsupervised Learning and Principal Component Analysis**

Unsupervised learning, in contrast to supervised learning, is the study of detecting patterns in an unlabelled dataset X. Applications of unsupervised learning include dimensionality reduction [66,63], clustering [29,72], and anomaly detection [38,1]. Principal Component Analysis (PCA) is an unsupervised learning dimensionality reduction technique. PCA computes an orthogonal transformation of a dataset X composed of points in R<sup>n</sup> to a new data set X composed of points in R<sup>n</sup> where n < n. PCA is an incremental algorithm, wherein, each iteration a new component (or dimension) is computed. On the first iteration, a hyperplane is fit around the dataset X and its corresponding spanning vector is the first element of the basis around the transformation onto X . On each subsequent iteration, a new hyperplane is computed under the additional constraint of it being orthogonal to its predecessors. This process is repeated until the desired number of iterations is achieved [32,66].

#### **3 An overview of MachSMT**

In this section, we provide an overview of the MachSMT tool. The architecture diagram of MachSMT is presented in Figure 1.


Table 1: Complete list of the 162 features used in MachSMT

**Fig. 1** Architecture of MachSMT.

#### **3.1 Features, Preprocessing, and Learning**

MachSMT uses a feature vector with 162 entries (i.e., dimensions). A complete description of each feature is provided in Table 1. We deploy two strategies to mitigate taxing feature calculation times, which can severely impair algorithm selection solutions. First, all features are entirely syntactical properties of the input. This is a major difference between MachSMT and other algorithm selection solutions, such as SATZilla. Second, all features are calculated within a strict and user-adjustable timeout (default 10s). On a timeout, the feature value is recorded as −1.0.

MachSMT performs three key preprocessing steps before constructing any learnt models over a given dataset. We describe each subsequently. First, all feature values are scaled to zero mean and unit variance<sup>3</sup>. This data normalization technique is common in ML research and applications to improve both model efficiency and numerical robustness. The second step in the preprocessing pipeline is computing the polynomial interaction terms of degree two on the resultant normalized feature vector. These polynomial features make interacting correlations of features explicit. These first two preprocessing steps are included in the SATZilla preprocessing pipeline [71].

As discussed in Section 2, ML in a high dimensional space is prone to the curse of dimensionality. While other algorithm selection solutions (e.g., SATZilla) commonly implement feature selection solutions, we propose the use of learnt dimensionality, namely PCA. As discussed above, feature selection can be a proactive solution to the curse of dimensionality but presents many challenges when applying to SMT. Internally MachSMT manages more than a thousand learnt models, and calculating optimal feature subsets for each one is infeasible.

<sup>3</sup> x−μ <sup>σ</sup> , where x is a feature sample, μ is the mean across the specific feature on the training set, and σ is the deviation across the specific feature on the training set.

The third and final preprocessing step is applying PCA on the resultant polynomial features. The final feature vector is composed of the first 35 principal components. PCA is the final step in the MachSMT preprocessing pipeline. The resultant feature set is used when constructing the learnt models with AdaBoost. We use AdaBoost for both regression in the EHMs and classifications in the PWCs. We configure AdaBoost with 200 decision tree estimators and linear loss. MachSMT uses scikit-learn and numpy as its ML backend and the entire tool is written in Python [49]. MachSMT is easily extensible and supports any ML model/pipeline under scikit-learn syntax.

#### **3.2 Variants of MachSMT**

MachSMT implements the following algorithm selection solutions.


MachSMT by default creates models for all aforementioned approaches to algorithm selection. In evaluation, MachSMT evaluates each approach's performance on each logic. In deployment, MachSMT uses the approach that had the best-observed performance in evaluation.

#### **3.3 Using MachSMT**

MachSMT consists of three core tools, which are used to build, evaluate, and deploy MachSMT, respectively.

1. machsmt build – This tool is the interface for building MachSMT's database around the solvers and benchmarks provided by the user. It takes as input a csv data file denoting the columns 'solver', 'benchmark', and 'score'. The output is a library directory containing the resultant database, and learnt models under default settings.

```
machsmt build -f data.csv -l /path/to/lib/dir
```


Table 2: Selected results of MachSMT on data from SMT-COMP 2019 and 2020. All numbers are percent differences of PAR-2 scores across all benchmarks. Columns 3 and 4 show the improvement over random selection and competition winners (higher is better). Column 5 shows the PAR-2 difference to the VBS (lower is better).

2. machsmt eval – This tool takes as input the library directory generated by machsmt build and evaluates it under k-fold cross validation and provides a summary of results. It further tunes MachSMT to use the best empirically observed variant based on the logic and track of the input benchmark.

#### machsmt eval -l /path/to/lib/dir

3. machsmt – This tool is the primary interface to MachSMT' algorithm selection. Provided an input benchmark and its library files, it will output a ranking of solvers that are predicted to solve the benchmark the fastest.

machsmt benchmark.smt2 -l /path/to/lib/dir

**Fig. 2** Plot for BV in the Single Query (SQ) Track in SMT-COMP '19.

#### **3.4 User-defined Features**

We include a simple interface for users to extend the considered features in MachSMT's algorithm selection. All that is required is to create a Python method that returns a single floating-point number (or an iterable object thereof) representing the feature. As input, the user enters the path of the SMT-LIB input, as well as its logic and track. If a user feature is to be considered by MachSMT, the user-defined procedure should return its floating-point representation; otherwise, it returns none. All user-defined features are automatically included in building MachSMT. These custom features in principal can significantly affect the accuracy of MachSMT when engineered to target a specific class of benchmarks.

#### **4 Experimental Evaluation of MachSMT on SMT-COMP 2019 and 2020 Data**

In this section, we present the evaluation of our MachSMT tool (refer to Table 2 and CDF plots in Figures 2–6), specifically with the benchmarks, solvers, and solver runtime analysis from SMT-COMP 2019 and 2020. The artifact is available at: https://zenodo.org/record/4458699.

**Fig. 3** Plot for NRA in the Single Query (SQ) Track in SMT-COMP '19.

#### **4.1 Experimental Setup and Methodology**

In this experiment, we used the benchmarks, timing analysis, and solvers provided by the organizers of the SMT-COMP 2019 and 2020 competitions [31,6]. In both years, all solver input queries were performed on the StarExec computing service [58], which consists of a cluster of 2.4 GHz Intel Xeon machines running Red Hat Enterprise Linux 7.2. Each solver/benchmark pair was configured to have 4 cores and 60GB of memory available. The time limit for each pair was 2400 seconds in 2019, and 1200 seconds in 2020.

We evaluate MachSMT and all of its variants using k-fold cross validation (with k = 5). In cross validation, the dataset is randomly partitioned into k subsets per division. A model is then trained over k − 1 subsets and makes predictions over the subset that is excluded from training. This process is repeated to obtain fair predictions for each subset. Cross validation is commonly deployed to analyze machine learning models. For more details, please see Section 2.

#### **4.2 Experimental Results**

For every division, we evaluated MachSMT by checking whether we beat the competition winner from each division. For the sequential tracks, we evaluate solvers across, according to PAR-2 scores (i.e., the wallclock runtime on success-

**Fig. 4** Division QF BVFPLRA in the Single Query Track in SMT-COMP 2020.

ful termination, otherwise twice the wallclock timeout)<sup>4</sup> [42]. For incremental tracks, we use the following formula:

$$w + (2 \* t/n) \* (n - m)$$

where w is the wall clock runtime, t is the wallclock timeout, n is the total number of check-sats in the benchmark, and m is the total number of check-sats successfully solved.

We present select results in Table 2. We consider three baselines when evaluating MachSMT, namely: random algorithm selection, the competition winner, and the virtual best solver (VBS) (note, VBS is perfect algorithm selection and cannot be beaten). We consider all divisions of at least 25 benchmarks and observe MachSMT to improve on the competition winner in 54 out of 85. We report the results for MachSMT-SolverLogicEHM in the table as it is by far the most performant, dominating in all divisions except for 4.

We present select CDF plots in Figures 2-6. A CDF plot is a visualization of how a solver performs on a database of inputs. A point (X,Y) denotes that a solver S solves Y inputs within X seconds each.

<sup>4</sup> In case of an incorrect answer, the score is recorded as 10 times the wallclock timeout.

**Fig. 5** Division QF LIA in the Single Query Track in SMT-COMP 2020.

#### **5 Analysis and Discussion of Results**

In Section 3.2, we describe four formulations of MachSMT. In our evaluation (see Table 2), we observe MachSMT-SolverLogicEHM to be significantly more performant than all other formulations. When evaluating over SMT-COMP, in all divisions that MachSMT improved over the competition winner, MachSMT-SolverLogicEHM was the most performant in all except for three (which were won by MachSMT-SolverLogicPWC).

Our experimental results validate the idea that algorithm selection (in particular through the use of EHMs) can be a powerful way to address the combinatorial explosion that solver users face when trying to decide which solverconfiguration pair is best suited for their application. We note that MachSMT is particularly powerful in the context of logics, such as QF UFBV, that are derived from a diverse set of applications and a wide variety of algorithms have been designed to solve them. As has been noted in previous work, algorithm selection methods work well for non-homogeneous benchmarks, especially where there is no single algorithm (solver) that performs the best across the board. EHMs are an effective way to distinguish between such algorithms given a problem instance and predict which one might perform the best on said instance.

One major threat to the validity of any ML solution is the generalizability of the learnt models on unseen data. It has been noted in previous work that a practical way to address this issue is to use k−fold cross validation scheme [54,43], thus motivating our use of this approach in our experiments. We further note

**Fig. 6** Division QF UFBV in the Single Query Track in SMT-COMP 2020.

that our evaluation of MachSMT includes decades of runtime analysis and more than 100 GB of benchmarks spanning numerous applications, giving us greater confidence in the robustness of our results.

#### **6 Related Work**

In this section we provide an overview of previous work on algorithm selection in the context of constraint solvers and contrast it with MachSMT.

#### **6.1 Key differences between SATZilla and MachSMT**

As mentioned above, SATZilla was the first algorithm selection method in the context of logic solvers [67]. While our work is inspired by SATZilla, MachSMT differs from SATZilla in several key ways. First, SATZilla deploys a feature selection scheme to avoid the curse of dimensionality. While good in practice for the SAT setting, feature selection does lose significant amounts of information. Further, it can be very expensive to compute optimal feature subsets.

By contrast, MachSMT leverages a learnt dimensionality reduction scheme, namely, Principal Component Analysis (PCA). The key advantage of PCA is that it does not perform a search for optimal feature subset (like one has to do in the context of feature selection), and hence is significantly more efficient. In fact, a feature selection method is unlikely to scale for SMT solvers, unlike SAT, simply because of the significantly larger number of features, logics, and solvers that one has to contend with. Second, MachSMT deploys a modern ML pipeline, including an ensemble learning approach, namely Adaptive Boosting [21].

#### **6.2 Algorithm Selection for Logic Solvers and Their Applications**

Algorithm selection tools have a rich history and have been around since at least 1976 when Rice et al. were the first to propose it [51]. Algorithm selectors have been extensively used in many contexts, e.g., classifiers for machine learning [2], combinatorics [37], and other NP-hard optimization problems [60,62]. Within the context of solvers, algorithm selectors have been proposed for QBF [50,41], SAT [67,68,69], CSP solvers [25,3,34], and recommenders for ATP tools [59,61].

In the setting of SMT solver applications, symbolic execution tools have used algorithm selection strategies [64] and portfolio strategies [33] for the specific classes of instances within the context of the bit-vector theory. This would be an ideal use case of MachSMT, since we provide a more complete solution.

There have been other works using machine learning to improve the performance of SMT solvers. Balunovic et al. [5] use neural networks and synthesis to find tactics and strategies for three SMT-LIB theories. A previous version of our work proposed an algorithm selection tool for the QF FP theory [57]. To the best of our knowledge, MachSMT is the first publicly available tool for the entirety of SMT-LIB. Other works have leverage machine learning to improve internal heuristics in solvers [12,52,40]

Pairwise ranking has been used in algorithm selection in the latest versions of SATZilla [70], as well as in other settings such as variable selection in the context of splitting heuristics in divide-and-conquer parallel SAT solvers [45].

#### **7 Conclusions and Future Work**

In this paper, we presented MachSMT, the first algorithm selection tool that spans the entirety of the SMT-LIB logics. MachSMT is designed to be userfriendly and easily modifiable by users for their specific application and SMT solvers of interest.

Using MachSMT, we observe improvement in 54 out of 85 divisions in all tracks from the SMT-COMP 2019 and 2020, with up to a 198.4% improvement for the QF BVFPLRA SQ '20 division in PAR-2 score. Most of the logics on which we don't see improvement are ones for which we have very few benchmarks.

For future work, we plan to extend our scoring scheme to take into account model validation and unsat core divisions. We further plan to extend our feature set with more (theory-)specific features based on feedback from the SMT community. It is very likely that users may have domain-specific knowledge about certain features that might be most predictive of solver runtime for their particular application. Hence, we have provided an interface to easily extend and specialize MachSMT to a user's specific setting.

#### **References**


Everest: Towards a verified, drop-in replacement of HTTPS. In: Lerner, B.S., Bod´ık, R., Krishnamurthi, S. (eds.) 2nd Summit on Advances in Programming Languages, SNAPL 2017, May 7-10, 2017, Asilomar, CA, USA. LIPIcs, vol. 71, pp. 1:1–1:12. Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik (2017). https://doi.org/10.4230/LIPIcs.SNAPL.2017.1, https://doi.org/10.4230/ LIPIcs.SNAPL.2017.1


72. Xu, R., II, D.C.W.: Survey of clustering algorithms. IEEE Trans. Neural Networks **16**(3), 645–678 (2005). https://doi.org/10.1109/TNN.2005.845141, https: //doi.org/10.1109/TNN.2005.845141

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Consistent \* Complete** AS \* **Artifact dtControl 2.0: Explainable Strategy Representation via Decision Tree Learning Steered by Experts** *-*

**\* Well Documen et d**

\* AEC

\* **Eva ul**

**t ysaE \***

> **a**

**o Reuse \***

\* TAC

**det**Pranav Ashok<sup>1</sup> , Mathias Jackermeier1, Jan Kˇret´ınsk´y<sup>1</sup> , Christoph Weinhuber1(-) , Maximilian Weininger<sup>1</sup> , and Mayank Yadav<sup>2</sup>

> <sup>1</sup> Technical University of Munich, Munich, Germany firstname.lastname@tum.de <sup>2</sup> Department of Computer Science and Engineering, I.I.T. Delhi, New Delhi, India cs1180356@iitd.ac.in

**Abstract.** Recent advances have shown how decision trees are apt data structures for concisely representing strategies (or controllers) satisfying various objectives. Moreover, they also make the strategy more explainable. The recent tool dtControl had provided pipelines with tools supporting strategy synthesis for hybrid systems, such as SCOTS and Uppaal Stratego. We present dtControl 2.0, a new version with several fundamentally novel features. Most importantly, the user can now provide domain knowledge to be exploited in the decision tree learning process and can also interactively steer the process based on the dynamically provided information. To this end, we also provide a graphical user interface. It allows for inspection and re-computation of parts of the result, suggesting as well as receiving advice on predicates, and visual simulation of the decision-making process. Besides, we interface model checkers of probabilistic systems, namely STORM and PRISM and provide dedicated support for categorical enumeration-type state variables. Consequently, the controllers are more explainable and smaller.

**Keywords:** Strategy representation · Controller representation · Decision Tree · Explainable Learning · Hybrid systems · Probabilistic Model Checking · Markov Decision Process

#### **1 Introduction**

A controller (also known as strategy, policy or scheduler) of a system assigns to each state of the system a set of actions that should be taken in order to achieve a certain goal. For example, one may want to satisfy a given specification of a robot's

<sup>-</sup> This work has been partially supported by the German Research Foundation (DFG) project No. 383882557 SUV (KR 4890/2-1), No. 427755713 GOPro (KR 4890/3-1) and the TUM International Graduate School of Science and Engineering (IGSSE) grant 10.06 PARSEC. We thank Tim Quatman for implementing JSON-export of strategies in STORM and Pushpak Jagtap for his support with the SCOTS models.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 326–345, 2021. https://doi.org/10.1007/978-3-030-72013-1 17

behaviour or exhibit a concurrency bug appearing only in some interleaving. It is desirable that the controllers possess several additional properties, besides achieving the goal, in order to be usable in practice. Firstly, controllers should be explainable. Only then can they be understood, trusted and implemented by the engineers, certified by the authorities, or used in the debugging process [11]. Secondly, they should be small in size and efficient to run. Only then they can be deployed on embedded devices with limited memory of a few kilobytes, while the automatically synthesized ones are orders of magnitude larger [49]. Thirdly, whenever the primary goal, e.g. functional correctness, is accompanied by a secondary criterion, e.g. energy efficiency, they should be performant with respect to this criterion.

Automatic controller synthesis is able to provide controllers for a given goal in various domains, such as probabilistic systems [32, 17], hybrid systems [45, 16, 30, 19] or reactive systems [35]. In some cases, even the performance can be reflected [16]. However, despite recent interest in explainability in connection to AI-based controllers [2] and despite typically small memories of embedded devices, automatic techniques for controller synthesis mostly fall short of producing small explainable results. A typical outcome is a controller in the form of a look-up table, listing the actions for each possible state, or a binary decision diagram (BDD) [14] representation thereof. While the latter reduces the size to some extent, none of the two representations is explainable: the former due to its size, the latter due to the bit-level representation with all high-level structure lost. Instead, learning representations in the form of decision trees (DT) [38] has been recently explored to this end [7, 3]. DTs turn out to be usually smaller than BDD but do not drown to the bit level and are generally well known for their interpretability and explainability due to their simple structure. However, despite showing significant potential, the state-of-the-art tool dtControl [4] uses predicates without natural interpretation, and moreover, the best size reductions are achieved using determinization, i.e. making the controller less permissive, which negatively affects performance [7].

Example 1 (Motivating example). Consider the cruise control model of [34], where we want to control the speed of our car so that it never crashes into the car in front while, as a secondary performance objective, keeping the distance between the two cars small.

A safe controller for the this model as returned by Uppaal Stratego, is a lookup table of size 418 MB with 300,000 lines. The respective BDD has 1,448 nodes with all information bit-blasted. Using adaptations of standard DT-construction algorithms, as implemented in dtControl, we can get a DT with 987 nodes, which is still too large to be explained. Using determinization techniques, the controller can be compressed to 3 nodes! However, then the DT allows only to decelerate until the minimum velocity. This is safe, as we cannot crash into the car in front, but it does not even attempt at getting close to the front car, and thus has a very bad performance.

One can find a strategy with optimal performance, retaining the maximal permissiveness, not determinizing at all, which can be represented by a DT with 11 nodes. A picture of this DT as well as reasoning how to derive the predicates from the kinematic equations is in the extended version of this paper [5, Appendix A].

However, exactly because the predicates are based on the domain knowledge, namely the kinematic equations, they take the form of algebraic predicates and not simply linear predicates, which are the only ones in dtControl and commonly in the machine-learning literature on DTs. 1

This motivating example shows that using domain knowledge and algebraic predicates, available now in dtControl 2.0, one can get smaller representation than when using existing heuristics. Further, it improves the performance of the DT, and it is easily explainable, as it is based on domain knowledge. In fact, the discussed controller is so explainable that it allowed us to find a bug in the original model. In general, using dtControl 2.0 a domain expert can try to compress the controller, thus gain more insight and validate that it is correct. Another example of this has been reported from the use of dtControl in the manufacturing domain [31].

While automatic synthesis of good predicates from the domain knowledge may seem as distant as automatic synthesis of program invariants or automatic theorem provers, we adopt the philosophy of those domains and offer semi-automatic techniques.

Additionally, if not performance but only safety of a controller is relevant, we can still benefit from determinization without drawbacks. To this end, we also provide a new determinization procedure that generalizes the extremely successful MaxFreq technique of [4] and is as good or better on all our examples.

To incorporate the changes just discussed, namely algebraic predicates, semiautomatic approach, and better determinization, we have also reworked the tool and its interfaces. To begin with, the software architecture of dtControl 2.0 is now very modular and allows for easy further modifications, as well as adding support for new synthesis tools. In fact, we have already added parsers for the tools STORM [17] and PRISM [32], and thus we support probabilistic models as well. Since these models also contain categorical (or enumerationtype) variables, e.g. protocol states, we have also added support for categorical predicates. Furthermore, we added a graphical user interface that not only is easier to use than the command-line interface, but also allows to inspect the DT, modify and retrain parts of it, and simulate runs of the model under its control, further increasing the possibilities to explain the DT and validate the controller.

Summing up, the main improvements of dtControl 2.0 over the previous version [4] are the following:


The paper is structured as follows. After recalling necessary background in Section 2, we give an overview of the improvements over the previous version of

the tool from the global perspective in Section 3. We detail on the algorithmic contribution in Sections 4 (predicate domains), 5 (predicate selection) and 6 (determinization). Section 7 provides experimental evaluation and Section 8 concludes.

**Related work.** DTs have been suggested for representing controllers of and counterexamples in probabilistic systems in [11], however, the authors only discuss approximate representations. The ideas have been extended to other setting, such as reactive synthesis [12] and hybrid systems [7]. More general linear predicates have been considered in leaves of the trees in [3]. dtControl 2.0 contains the DT induction algorithms from [7, 3]. The differences to the previous version of the tool dtControl [4] are summarized above and schematically depicted in Figure 2.

Besides, DTs have been used to represent and learn strategies for safety objectives in [40] and to learn program invariants in [21]. Further, DTs were used for representing the strategies during the model checking process, namely in strategy iteration [10] or in simulation-based algorithms [42]. Representing controllers exactly using a structure similar to DT (mistakenly claimed to be an algebraic decision diagram) was first suggested by [22], however, no automatic construction algorithm was provided.

The idea of non-linear predicates has been explored in [28]. In that work, however, it is not based on domain knowledge, but rather on projecting the state-space to higher dimensions.

BDDs [14] have been commonly used to represent strategies in planning [15], symbolic model checking [32] as well as to represent hybrid system controllers [45, 30]. While BDD [14] operate only on Boolean variables, they have the advantage of being diagrams and not trees. Moreover, they correspond to Boolean functions that can be implemented on hardware easily. [18] proposes an automatic compression technique for numerical controllers using BDDs. Similar to our work, [49] considers the problem of obtaining concise BDD representation of controllers and presents a technique to obtain smaller BDDs via determinization. However, BDDs are difficult to explain due to variables being bit-blasted and their size is very sensitive to the chosen variable ordering. An extension of BDDs, algebraic or multi-terminal decision diagrams (ADD/MTBDD) [8, 20], have been used in reinforcement learning for strategy synthesis [26, 47]. ADDs extend BDDs with the possibility to have multiple values in the terminal nodes, but the predicates still work only on boolean variables, retaining the disadvantages of BDDs.

#### **2 Decision tree learning for controller representation**

In this section, we briefly describe how controllers can be represented as decision trees as in [4]. We give an exemplified overview of the method, pinpointing the role of our algorithmic contributions.

A (non-deterministic, also called permissive) controller is a map <sup>C</sup> : <sup>S</sup> <sup>→</sup> <sup>2</sup><sup>A</sup> from states to non-empty sets of actions. This notion of a controller is fairly

general; the only requirement is that it has to be memoryless and non-randomized. These kind of controllers are optimal for many tasks such as expected (discounted) reward, reachability or parity objectives. Moreover, even finite-memory controllers can be written in this form by considering the product of the state space with the finite memory as the domain, for example, like in LTL model checking.

Decision trees (DT), e.g. [38], are trees where every leaf node is labelled with a non-empty set of actions and every inner node is labelled with a predicate ρ : S → {true, false}.

Fig. 1: An example controller based on the cruise-control model in the form of a lookup table (left), and the corresponding decision tree (right).

Example 2 (Decision tree representation). As an example, consider the controller given in Figure 1a. It is a subset of the real cruise-control case study from the motivating Example 1. A state is a 3-tuple of the variables vo, v<sup>f</sup> and d, which denote the velocity of our car, the front car and the distance between the cars respectively. In each state, our car may be allowed to perform a subset of the following set of actions: decelerate (dec), stay in neutral (neu) or accelerate (acc). A DT representing this lookup table is depicted in Figure 1b.

Given a state, for example v<sup>o</sup> = v<sup>f</sup> = 4, d = 10, the DT is evaluated as follows: We start at the root and, since it is an inner node, we evaluate its predicate v<sup>o</sup> > 0. As this is true, we follow the true branch and reach the inner node labelled with the predicate v<sup>f</sup> > 4. This is false, so we follow the false branch and reach the leaf node labelled {dec, neu}. Hence, we know that all three possibilities of decelerating, staying neutral and accelerating are allowed by the controller. 1

To construct a DT representation of a given controller, the following recursive algorithm may be used. Note that it is heuristic since constructing an optimal binary decision tree is an NP-complete problem [27].

Base case: If all states in the the controller agree on their set of actions B (i.e. for all states s we have C(s) = B), return a leaf node with label B.

Recursive case: Otherwise, we split the controller. For this, we select a predicate ρ and construct an inner node with label ρ. Then we partition the controller by evaluating the predicate on the state space, and recursively construct one DT for the sub-controller on states {s ∈ S | ρ(s)} where the predicate is true, and one for the sub-controller where it is false. These controllers are the children of the inner node with label ρ and we proceed recursively.

For selecting the predicate, we consider two hyper-parameters: The domain of the predicates (see Section 4) and the way to select predicates (see Section 5). The selection is typically performed by selecting the predicate with the lowest impurity; this is a measure for how homogenous (or "pure") the controller is after the split, in other words the degree to which all the states agree on their actions.

We also consider a third hyper-parameter of the algorithm, namely determinization by safe early stopping (see Section 6). This modifies the base case as follows: if all states in the controller agree on at least one action a (i.e. for all states s we have a ∈ C(s)), then we return a leaf node with label {a}. This variant of early stopping ensures that, even though the controller is not represented exactly, still for every state a safe action is allowed.

Hence, if the original controller satisfies some property, e.g. that a safe set of states is never left, the DT construction algorithm ensures that this property is retained. This is because our algorithm represents the strategy exactly (or a safe subset, in case of determinization) and does not generalize as DTs typically do in machine learning. DTs are suitable for both tasks, as both rely on the strength of DTs exploiting underlying structure.

Remark 1. Note that for some types of objectives such as reachability, determinization of permissive strategies might lead to a violation of the original guarantees. For example, consider a strategy that allows both a self-looping and a non-self-looping action at a particular state. If the determinizer decides to restrict to the self-looping action, the reachability property may be violated in the determinized strategy. However, this problem can be addressed when synthesizing the strategy by ensuring that every action makes progress towards the target.

#### **3 Tool**

dtControl 2.0 is an easy-to-use open-source tool for representing memoryless symbolic controllers as more compact and more interpretable DTs, while retaining safety guarantees of the original controllers. Our website dtcontrol.model.in. tum.de offers hyperlinks to the easy-to-install pip package3, the documentation and the source code. Additionally, the artifact that has passed the TACAS 21 artifact evaluation is available here [6].

The schema in Figure 2 illustrates the workflow of using dtControl, highlighting new features in red. Considering dtControl as a black box, it shows that given a controller, it returns a DT representing the controller and also offers the possibility to simulate a run of the system under the control of the DT, visualizing

<sup>3</sup> pip is a standard package-management system used to install and manage software packages written in Python.

Fig. 2: An overview of the components of dtControl 2.0, thereby showing software architecture and workflow. Contributions of this paper are highlighted in red.

the decisions made. The controller can be input in various formats, including the newly supported strategy representations of the well-known probabilistic model checkers PRISM [32] and STORM [17]. The DT is output in several machine readable formats, and as C-code that can be directly used for executing the controller on embedded devices. Note that this C-code consists only of nested if-else-statements. The new graphical user interface also offers the possibility to inspect the graph in an interactive web user interface, which even allows to edit the DT. This means that parts of the DT can be retrained with a different set of hyper-parameters and directly replaced. This way, one can for example first train a determinized DT and then retrain important parts of it to be more permissive and hence more performant for a secondary criterion. Figure 3 shows a screenshot of the newly integrated graphical user interface.

Looking at the inner workings of dtControl, we see the three important hyper-parameters that were already introduced in Section 2: predicate domain, predicate selector, and determinizer. For each of these, dtControl offers various choices, some of which were newly added for version 2.0. Most prominently, the user now has the possibility to directly influence both the predicate domain and the predicate selector, by providing domain knowledge and thus also additional predicates, or by directly using the interactive predicate selection. More details on the predicate domain and how domain knowledge is specified can be found in Section 4. The different ways to select predicates, especially the new interactive mode, are the topic of Section 5. Our new insights into determinization are


Fig. 3: Screenshot of the new web-based graphical user interface. It offers a sidebar for easy selection of the controller file and hyper-parameters, an experiments table where benchmarks can be queued, and a results table in which some statistics of the run are provided. Moreover, users can click on the 'eye' icon in the results table to inspect the built decision tree.

described in Section 6. To support the user in finding a good set of hyperparameters, dtControl also offers extensive benchmarking functionality, allowing to specify multiple variants and reporting several statistics.

**Technical notes.** dtControl 2.0 is written in Python 3 following an architecture closely resembling the schema in Figure 2. The modularity, along with our technical documentation, allows users to easily extend the tool. For example, supporting another input format is only a matter of adding a parser.

dtControl 2.0 works with Python version 3.7.9 or higher. The core of the tool which runs the learning algorithms requires numpy [23], pandas [36] and scikit-learn [41] and optionally the library for the heuristic OC1 [39]. The algebraic predicates rely on SymPy [37] and SciPy [48]. The web user interface is powered by Flask [1] and D3.js [9].

#### **4 Predicate domain**

The domain of the predicates that we allow in the inner nodes of the DT is of key importance. As we saw in the motivating Example 1, allowing for more expressive predicates can dramatically reduce the size of the DT.

We assume that our state space is structured, i.e. it is a Cartesian product of the domain of the variables (S = S<sup>1</sup> × ... × Sn). We use s<sup>i</sup> to refer to the i-th state-variable of a state s ∈ S. In Example 2, the three state-variables are the velocity of our car, the velocity of the front car, and the distance.

We first give an overview of the predicate domains dtControl 2.0 supports, before discussing the details of the new ones.

Axis-aligned predicates [38] have the form s<sup>i</sup> ≤ c, where c is a rational constant. This is the easiest form of predicates, and they have the advantage that there are only finitely many, as the domain of every state-variable is bounded. However, they are also least expressive.

Linear predicates (also known as oblique [39]) have the form - <sup>i</sup> s<sup>i</sup> · a<sup>i</sup> ≤ c, where a<sup>i</sup> are rational coefficients and c is a rational constant. They have the advantage that they are able to combine several state-variables which can lead to saving linearly many splits, cf. [29, Fig. 5.2]. The disadvantage of these predicates is that there are infinitely many choices of coefficients, which is why heuristics were introduced to determine a good set of predicates to try out [39, 4]. However, heuristically determined coefficients and combinations of variables can impede explainability.

Algebraic predicates have the form f(s) ≤ c, where f is any mathematical function over the state-variables and c is a rational constant. It can use elementary functions such as exponentiation, log, or even trigonometric functions. Example 1 illustrated how this can reduce the size and improve explainability. More discussion of these predicates follows in Section 4.2.

Categorical predicates are special predicates for categorical (enumerationtype) state-variables such as colour or protocol state, and they are discussed in Section 4.1.

#### **4.1 Categorical predicates**

Categorical state-variables do not have a numeric domain, but instead are unordered and qualitative. They commonly occur in the models coming from the tools PRISM and STORM.

Example 3. Let one state-variable be 'colour' with the domain {red, blue, green}. A simple approach is to assign numbers to every value, e.g. red = 0, blue = 1, green = 2, and treat this variable as numeric. However, a resulting predicate such as colour ≤ 2 is hardly explainable and additionally depends on the assignment of numbers. For example, it would not be possible to single out colour ∈ {red, green} using a single predicate, given the aforementioned numeric assignment. Using linear predicates, for example adding half of the colour to some other state-variable, is even more confusing and dependent on the numeric assignment. 1

Instead of treating the categorical variables using their numeric encodings, dtControl 2.0 supports specialized algorithms from literature, see e.g. [43, 44]. They work by labelling an inner node with a categorical variable and performing a (possibly non-binary) split according to the value of the categorical variable. The node can have at most one child for every possible value of the categorical variable, but it can also group together similarly behaving values, see Figure 4 for an example. For the grouping, dtControl 2.0 uses the greedy algorithm from [44, Chapter 7] called attribute-value grouping. It proceeds by first considering to have a branch for every single possible value of the categorical variable, and then merging branches as long as it improves the predicate; see [5, Appendix C] for the full pseudocode of the algorithm.

In our experiments we found that the grouping algorithm sometimes did not merge branches in cases where it would actually have made the DT smaller or more explainable. This is because the resulting impurity, the goodness of a predicate, could be marginally worse due to floating-point inaccuracies. Thus, we introduce tolerance, a bias parameter in favour of larger value groups. When checking whether to merge branches, we do not require the impurity to improve, but we allow it to become worse up to our tolerance. Setting tolerance to 0 corresponds exactly to the algorithm from [44], while setting tolerance to ∞ results in merging branches until only two remain, thus producing binary predicates.

To allow dtControl 2.0 to use categorical predicates, the user has to provide a metadata file, which tells the tool which variables are categorical and which are numeric; see [5, Appendix B.1] for an example.

#### **4.2 Algebraic predicates**

It is impossible to try out every mathematical expression over the state-variables, and it would also not necessarily result in an explainable DT. Instead, we allow the user to enter domain knowledge to suggest templates of predicates that dtControl 2.0 should try. See [5, Appendix B.2] for a discussion of the format in which domain knowledge can be entered.

Providing the basic equations that govern the model behaviour can already help in finding a good predicate, and is easy to do for a domain expert. Additionally, dtControl 2.0 offers several possibilities to further exploit the provided domain knowledge:

Firstly, the given predicates need not be exact, but may contain coefficients. These coefficients can be both completely arbitrary or may come from a finite set suggested by the user. For coefficients with finite domain, dtControl 2.0 tries all possibilities; for arbitrary coefficients, it uses curve fitting to find a good

Fig. 4: Two examples of a categorical split. On the left, all possible values of the statevariable colour lead to a different child in a non-binary split. On the right, red and green lead to the same child, which is a result of grouping similar values together.

value. For example, the user can specify a predicate such as d + (v<sup>o</sup> − v<sup>f</sup> )· c<sup>0</sup> > c<sup>1</sup> with c<sup>0</sup> being an arbitrary rational number and c<sup>1</sup> ∈ {0, 5, 10}.

Secondly, the interactive predicate selection (see Section 5) allows the user to try out various predicates at once and observe their respective impurity in the current node. The user can then choose among them as well as iteratively suggest further predicates, inspired by those where the most promising results were observed.

Thirdly, the decisions given by a DT can be visualized in the simulator, possibly leading to better understanding the controller. Upon gaining any further insight, the user can directly edit any subtree of the result, possibly utilizing the interactive predicate selection again.

#### **5 Predicate selection**

The tool offers a range of options to affect the selection of the most appropriate predicate from a given domain.

Impurity measures: As mentioned in Section 2, the predicate selection is typically based on the lowest impurity induced. The most commonly used impurity measure (and the only one the first version of dtControl supported) is Shannon's entropy [46]. In dtControl 2.0, a number of other impurity measures from the literature [43, 13, 25, 39, 3] are available. However, our results indicate that entropy typically performs the best, and therefore it is used as the default option unless the user specifies otherwise. Due to lack of space, we delegate the details and experimental comparison between the impurity measures to [5, Appendix D].

Priorities: dtControl 2.0 also has the new functionality to assign priorities to the predicate generating algorithms. Priorities are rational numbers between 0 and 1. The impurity of every predicate is divided by the priority of the algorithm that generated it. For example, a user can use axis-aligned splits with priority 1 and a linear heuristic with priority <sup>1</sup>/2. Then the more complicated linear predicate is only chosen if it is at least twice as good (in terms of impurity) as the easier-to-understand axis-aligned split. A predicate with priority 0 is only considered after all predicates with non-zero priority have failed to split the data. This allows the user to give just a few predicates from domain knowledge, which are then strictly preferred to the automatically generated ones, but which need not suffice to construct a complete DT for the controller.

Interactive predicate selection: dtControl 2.0 offers the user the possibility to manually select the predicate in every split. This way, the user can prefer predicates that are explainable over those that optimize the impurity.

The screenshot of the interactive interface in [5, Appendix F] shows the information that dtControl 2.0 provides. The user is given some statistics and metadata, e.g. minimum, maximum and step size of the state-variables in the current node, a few automatically generated predicates for reference and all predicates generated from domain knowledge. The user can specify new predicates and is immediately informed about their impurity. Upon selecting a predicate, the split is performed and the user continues in the next node.

The user can also first construct a DT using some automatic algorithm and then restart the construction from an arbitrary node using the interactive predicate selection to handcraft an optimized representation, or at any point decide that the rest of the DT should be constructed automatically.

#### **6 New insights about determinization**

In our context, determinization denotes a procedure that, for some or all states, picks a subset of the allowed actions. Formally, a determinization function δ transforms a controller C into a "more determinized" C , such that for all states <sup>s</sup> <sup>∈</sup> <sup>C</sup> we have <sup>∅</sup> <sup>C</sup> (s) ⊆ C(s). This reduces the permissiveness, but often also reduces the size. Note that, for safety controllers, this always preserves the original guarantees of the controller. For other (non-safety) controllers, see Remark 1.

dtControl 2.0 supports three different general approaches to determinizing a controller: pre-processing, post-processing and safe early stopping. Pre-processing commits to a single determinization before constructing the DT. Post-processing prunes the DT after its construction, e.g. safe pruning in [7]. The basic idea of safe early stopping is already described in Section 2: if all states agree on at least one action, then instead of continuing to split the controller, stop early and return a leaf node with that common action. Alternatively, to preserve more permissiveness, one can return not only a single common action, but all common actions; formally, return the maximum set B such that for all states s in the node B ⊆ C(s).

The results of [4] show that both pre-processing and post-processing are outperformed by an on-the-fly approach based on safe early stopping. This is because pre-processing discards a lot of information that could have been useful in the DT construction and post-processing can only affect the bottom-most nodes of the resulting DT, but usually not those close to the root.

We now give a new view on safe early stopping approaches for determinizing a controller that allows us to generalize the techniques of [4], reducing the size of the resulting DTs even more.

Example 4. Consider the following controller: C(s1) = {a, b, c}, C(s2) = {a, b, d}, C(s3) = {x, y}. All three states map to different sets of actions, and thus an impurity measure like entropy penalizes grouping s<sup>1</sup> and s<sup>2</sup> the same as grouping s<sup>1</sup> and s3. However, if determinization is allowed, grouping s<sup>1</sup> and s<sup>2</sup> need not be penalized at all, as these states agree on some actions, namely a and b. Grouping s<sup>1</sup> and s<sup>2</sup> into the same child node thus allows the algorithm to stop early at that point and return a leaf node with {a, b}, in contrast to grouping s<sup>1</sup> and s3. 1

Knowing that we want to determinize by safe early stopping affects the predicate selection process. Intuitively, sets of states are more homogeneous the

more actions they share. We want to take this into account when calculating the impurity of predicates. One way to do this would be to calculate the impurity of all possible determinization functions and pick the best one. This, however, is infeasible, hence we propose the heuristic of multi-label impurity measures. These impurity measures do not only consider the full set of allowed actions in their calculation, but instead they depend on the individual actions occurring in the set. This allows the DT construction to pick better predicates, namely those whose resulting children are more likely to be determinizable. In [5, Appendix E] we formally derive the multi-label variants of entropy and Gini-index.

To conclude this section, we point out the key difference between the new approach of multi-label impurity measures and the previous idea that was introduced in [4]. The approach from [4] does not evaluate the impurity of all possible determinization functions, but rather picks a smart one – that of maximum frequency (MaxFreq) – and evaluates according to that. MaxFreq determinizes in the following way: for every state, it selects from the allowed actions that action occurring most frequently throughout the whole controller. This way, many states share common actions. This is already better than pre-processing, as it does not determinize the controller a priori, but rather considers a different determinization function at every node. However, in every node we calculate the impurity for several different predicates, and the optimal choice of determinization function depends on the predicate. Thus, choosing a single determinization function for a whole node is still too coarse, as it is fixed independent of the considered predicate. We illustrate the arising problem in the following Example 5.

Fig. 5: A simple example of a dataset that is split suboptimally by the MaxFreq approach from [4], but optimally by the new multi-label entropy approach.

Example 5. Figure 5 shows a simple controller with a two-dimensional state space. Every point is labeled with its set of allowed actions.

As c is the most frequent action, MaxFreq determinizes the states (1, 2), (1, 3), (2, 2) and (2, 3) to action c. Hence the red split (predicate y < 1.5) is considered optimal, as it groups together all four states that map to c. The blue

split (predicate x < 1.5) is considered suboptimal, as then the data still looks very heterogeneous. So, using MaxFreq, we need two splits for this controller; one to split of all the c's and one to split the two remaining states.

However, it is better to first choose a predicate and then determine a fitting determinization function. When calculating the impurity of the blue split, we can choose to determinize all states with x = 1 to {a} and all states with x = 2 to {b}. Thus, in both resulting sub-controllers the impurity is 0 as all states agree on at least one action. This way, one split suffices to get a complete DT. Multi-label impurity measures notice when labels are shared between many (or all) states in a sub-controller, and thus they allow to prefer the optimal blue split. 1

#### **7 Experiments**

Experimental setup. We compare three approaches: BDDs, the first version of dtControl from [4] and dtControl 2.0. For BDDs<sup>4</sup> the variable ordering is important, so we report the smallest of 20 BDDs that we constructed by starting with a random initial variable ordering and reordering until convergence. To determinize BDDs, we used the pre-processing approach, 10 times with the minimum norm and 10 times with MaxFreq. For the previous version of dtControl, we picked the smaller of either a DT with only axis-aligned predicates or a DT with linear predicates using the logistic regression heuristic that was typically best in [4]. Determinization uses safe early stopping with the MaxFreq approach. For dtControl 2.0, we use the multi-label entropy based determinization and utilize the categorical predicates for the case studies from probabilistic model checking. We ran all experiments on a server with operating system Ubuntu 19.10, a 2.2GHz Intel(R) Xeon(R) CPU E5-2630 v4 and 250 GB RAM.

Comparing determinization techniques on cyber-physical systems. Table 1 shows the sizes of determinized BDDs and DTs on the permissive controllers of the tools SCOTS and Uppaal Stratego that were already used in [4]. We see that the new determinization approach is strictly better than the previous one, with only two DTs being of equal size, as the result of the previous method was already optimal. With the exception of the case studies helicopter and truck trailer where BDDs are comparable or slightly better, both approaches using DTs are orders of magnitude smaller than BDDs or an explicit representation of the state-action mapping.

Case studies from probabilistic model checking. For Table 2, we used case studies from the quantitative verification benchmark set [24], which includes models from the PRISM benchmark suite [33]. Note that these case studies contain unordered enumeration-type state-variables for which we utilize the new categorical predicates. To get the controllers, we solved the case study with STORM and exported the resulting controller. This export already eliminates unreachable states. The

<sup>4</sup> Our implementation of BDDs is based on the dd python library https://github. com/tulip-control/dd.

Table 1: Controller sizes of different determinized representations of the controllers from SCOTS and Uppaal Stratego. "States" is the number of states in the controller, "BDD" the number of nodes of the smallest BDD from 20 tries, dtControl 1.0 [4] the smallest DT the previous version of dtControl could generate and dtControl 2.0 the smallest DT the new version can construct. "TO" denotes a failure to produce a result in 3 hours. The smallest numbers in each row are highlighted.


previous version of dtControl was not able to handle these case studies, so we only compare dtControl 2.0 to BDDs.

Table 2 shows that also for case studies from probabilistic model checking, DTs are a good way of representing controllers. The DT is the smallest representation on 13 out of 19 case studies, often reducing the size by an order of magnitude compared to BDDs or the explicit representation. On 3 case studies, BDDs are smallest, and on 2 case studies, both the DT and the BDD fail to reduce the size compared to the explicit representation. This happens if there are many different actions and thus states cannot be grouped together. A worst case example of this is a model where every state has a different action; then, a DT would have as many leaf nodes as there are states, and hence twice as many nodes in total.

Remark 2. Note that the controllers exported by STORM are deterministic, so no determinization approach can be utilized in the DT construction. We conjecture that if a permissive strategy was exported, dtControl 2.0 would benefit from the additional information and be able to reduce the controller size further as for the cyber-physical systems.

#### **8 Conclusion**

We have presented a radically new version of the tool dtControl for representing controllers by decision trees. The tool now features a graphical user interface, allowing both experts and non-experts to conveniently interact with the decision tree learning process as well as the resulting tree. There is now a range of possibilities on how the user can provide additional information. The algebraic predicates provide the means to capture the (often non-linear) relationships from the domain knowledge. The categorical predicates together with the interface to probabilistic model checkers allow for efficient representation of strategies for Markov decision processes, too. Finally, the more efficient determinization yields



very small (possibly non-performant) controllers, which are particularly useful for debugging the model.

We see at least two major promising future directions. Firstly, synthesis of predicates could be made more automatic using mathematical reasoning on the domain knowledge, such as substituting expressions with a certain unit of measurement into other domain equations in the places with the same unit of measurement, e.g. to plug difference of two velocities into an equation for velocity. Secondly, one could transform the controllers into possibly entirely different controllers (not just less permissive) so that they still preserve optimality (or yield ε-optimality) but are smaller or simpler. Here, a closer interaction loop with the model checkers might lead to efficient heuristics.

### **References**


Informatics in Control, Automation and Robotics: Selcted Papers from the International Conference on Informatics in Control, Automation and Robotics 2008, pp. 75–87. Springer Berlin Heidelberg, Berlin, Heidelberg (2009)


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License(https://creativecommons.org/licenses/by/4. 0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**Tool Demo Papers**

#### **HLola**: a Very Functional Tool for Extensible Stream Runtime Verification*-*

Felipe Gorostiaga1,2,3(-) and Cesar S ´ anchez ´ <sup>1</sup>

 IMDEA Software Institute, Madrid, Spain Universidad Politecnica de Madrid, Madrid, Spain ´ CIFASIS, Rosario, Argentina {felipe.gorostiaga,cesar.sanchez}@imdea.org

Abstract. We present HLola, an extensible Stream Runtime Verification (SRV) tool, that borrows from the functional language Haskell (1) rich types for data in events and verdicts; and (2) functional features for parametrization, libraries, high-order specification transformations, etc.

SRV is a formal dynamic analysis technique that generalizes Runtime Verification (RV) algorithms from temporal logics like LTL to stream monitoring, allowing the computation of verdicts richer than Booleans (quantitative values and beyond). The keystone of SRV is the clean separation between temporal dependencies and data computations. However, in spite of this theoretical separation previous engines include hardwired implementations of just a few datatypes, requiring complex changes in the tool chain to incorporate new data types. Additionally, when previous tools implement features like parametrization these are implemented in an ad-hoc way. In contrast, HLola is implemented as a Haskell embedded DSL, borrowing datatypes and functional aspects from Haskell, resulting in an extensible engine4 . We illustrate HLola through several examples, including a UAV monitoring infrastructure with predictive characteristics that has been validated in online runtime verification in real mission planning.

### 1 Introduction

Runtime Verification [4,14,18] is a dynamic technique that studies (1) how to generate monitors from formal specifications, and (2) algorithms to monitor the system under analysis, one trace at a time. Early RV specification languages were based on logics like past LTL [19] adapted to finite traces [5,10,15], regular expressions [23], fix-point logics [1], rule based languages [3], or rewriting [21]. Verdicts and many times observations in most of these specification logics are restricted to Booleans, often because most early logics in RV were borrowed from static verification—where decidability is crucial. SRV [9,22] attempts to generalize these monitoring algorithms to richer datatypes, including in observations and verdicts. SRV offers declarative specifications where offset expressions allow accessing streams at different moments in time, including future instants. Most previous SRV developments [9, 11] and their extensions to event-based

© The Author(s) 2021

<sup>-</sup> This work was funded in part by the Madrid Regional Government under project "S2018/TCS-4339 (BLOQUES-CM)", by Spanish National Project "BOSCO (PGC2018-102210-B-100)".

<sup>4</sup> The tool is available open-source at http://github.com/imdea-software/hlola

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 349–356, 2021. https://doi.org/10.1007/978-3-030-72013-1 18

systems [8,11,12,17] focus on efficiently implementing the temporal engine, promising that new datatypes can be incorporated easily. However, in practice, adding a datatype requires modifying the parser, the internal representation and the runtime system. Consequently, existing tools only support a limited hardwired collection of datatypes (typically Booleans and numeric types for quantitative monitoring).

In this paper we demonstrate the tool HLola, whose core language is Lola [9], but that enables arbitrary datatypes. HLola is implemented as an embedded DSL in Haskell. Other RV tools implemented as eDSLs include [2, 13] (in Scala), and [24] which implements LTL as an eDSL in Haskell. The main theoretical novelty of HLola is a technique called *lift deep embedding*, that consists in borrowing types transparently from Haskell and embedding the resulting language back into Haskell (see [7] for an introduction to HLola with details of the theoretical underpinnings). In fact, most HLola datatypes were introduced after the temporal engine was completed without requiring any re-implementation. An eDSL enables higher-order functions to describe transformations that produce stream declarations from stream declarations, enabling stream parametrization for free. HLola libraries collect these transformers so new logics like LTL, MTL, etc with Boolean and quantitative semantics can be implemented in a few lines (see Section 2). Haskell type-classes enable *simplifiers*, which can anticipate the value of an expression without requiring the computation of all its sub-expressions. Implementing these in previous systems requires to re-invent and implement features manually (like macro expansions, etc). HLola even allows specifications as data to implement "specifications within specifications" (a feature that allows computing a full auxiliary specification at every instant, useful in simulation and for nested properties). This is used in an UAV scenario to implement Kalman filters [16] as monitors that predict the trajectory of the unmanned aircraft. The output of this monitor is used to anticipate problems (using another monitor) and take preventive planning actions.

Stream Runtime Verification in a nutshell SRV generalizes monitoring algorithms to arbitrary data, where datatypes are abstracted using multi-sorted first-order interpreted signatures (called data theories in the Lola terminology). The signatures are interpreted in the sense that every functional symbol f used to build terms of a given type is accompanied with an evaluation function f (the interpretation) that allows the computation of values (given values of the arguments). A Lola specification I, O, E consists of (1) a set of typed input stream variables I, which correspond to the inputs observed by the monitor; (2) a set of typed output stream variables O which represent the outputs of the monitor as well as intermediate observations; and (3) defining equations, which associate every output y ∈ O with a stream expression E<sup>y</sup> that describes declaratively the intended values of y. The set of *stream expressions* of a given type is built from constants and function symbols as constructors (as usual), and also from *offset expressions* of the form s[k, d] where s is a stream variable, k is an integer number and d is a value of the type of s used as default. For example, altitude[-1,0.0m] represents the value of stream altitude in the previous step of time, with 0.0m as default value to be used at the initial instant. Online efficient algorithms can be synthesized for specifications with (bounded) future accesses [9, 22], where efficiency means that resources (time and space) are independent of the length of the trace and can be calculated statically. HLola can be efficiently monitored in a trace-length independent sense [7].

#### 2 The **HLola** Tool

Fig. 1 shows the software architecture of HLola. We start from an HLola specification, which can borrow datatypes, notation and features from the Haskell language (represented by the red dashed arrow in Fig. 1). A simple translator processes the specification and generates code in the Haskell eDSL. The translator does not fully parse the spec and only preforms simple rewrites, leaving most of the specification unchanged. The resulting code is combined with the HLola engine (developed in Haskell) and compiled into a binary in the target platform. A well-known downside of this approach is that during the second compilation stage, error reports may be rather cryptic. On the other hand, a Haskell expert can write specifications directly in the embedded DSL, which still resembles Lola, to finely tune an HLola specification.

The enhanced capabilities of HLola with respect to Lola (streams as data, stream type polymorphism and parametric streams) impact the syntax of the language, which diverges slightly from the syntax of the original Lola. HLola files can either be libraries or specifications: *Libraries* include HLola code that define streams and facilities to create streams, and must be declared using **library <Name>** (where **<Name>** is the name of the library) on the first line of the HLola file. *Specifications* first state the format for input and output events as **format JSON** or **format CSV**. Source files then can import libraries and stream data manipulation facilities (called theories) with the statements **use library <Name>** and **use theory <Name>** respectively. HLola files can also import arbitrary Haskell libraries using the statement **use haskell <Name>**, and include Haskell code directly anywhere within the blocks delimited between #HASKELL and #ENDOFHASKELL. Specifications then define the input and output streams. An *Input stream* is declared by its type and name in a line of the form **input <Type> <name>**, just like in the original Lola language. The syntax of **<Type>** follows the Haskell notation. An *Output stream* is specified by its type, name and parameters on the left hand side of **=**, and its defining expression on the right hand side of **=**:

#### **output <TypeConstraints>? <Type> <name>** <args>\* **=** <Expr>,

where **<TypeConstraints>** is an optional set of constraints over the polymorphic types handled by the stream (expressed in Haskell notation), and <args> is an optional list of arguments of the form **<Type> <name>**. We can use **define** instead of **output** to define intermediate streams, whose values are not reported by the monitor but can be used by other streams. The defining <Expr> of an output stream allows the use of

Fig. 1. Software Architecture of HLola.

let clauses, where blocks, type annotation, do notation, etc. The access to the *value* of a stream s at the current instant uses the term s**[now]** to distinguish it from s, the stream itself (whose type is *stream of values*). The offset expression that accesses a stream s at a shift of i with default value d is written as s**[**i**|**d**]**, as in classic Lola. The symbol **'** is used to lift an object o from the theory as in **'**o. We sometimes indicate the arity of the object o being lifted for clarity or to aid the type inference as in **2'**o. To improve readability, some operators have been overridden by their lifted version, such as if-then-else.

*Libraries.* The following HLola file defines a library of Past-LTL operators, called **LTL**, as part of the HLola distribution5.

```
library LTL
use library Utils
output Bool historically <Stream Bool p> = p[now] && historically p [-1|'True]
output Bool once <Stream Bool p> = p[now] || once p[-1|'False]
output Bool since <Stream Bool p> <Stream Bool q> = q[now] ||
                                               (p[now] && p 'since' q [-1|'False])
output Int nFalses <Stream Bool p> = nFalses p[-1|0] + if p[now] then 0 else 1
output Double percFalses <Stream Bool p> = nFalses p[now] 'intdiv' (instantN[now])
```
The auxiliary library **Utils** includes instantN, which stores the current instant number. Stream **historically** is parametrized by **Bool**ean stream **p**. Once instantiated, historically p will be true until p becomes false for the first time, and will be false thereafter. This definition uses offsets to define the unrolling, using the constant value true in the first instant, lifted from Haskell as **'**True. This library also contains quantitative operators like **nFalses**, that counts the total number of falsifications up to an instant, and **percFalses** that calculates the ratio of falsifications. A similar library for MTL includes the parametrized definition of ϕ U(a,b)ψ:

```
output Bool until <(Int,Int) (a,b)> <Stream Bool phi> <Stream Bool psi> = from a
  where from a | a == b = psi[a|'False]
              | otherwise = psi[a|'False] || (phi[a|'True] && from (a+1))
```
Here the parametrized stream **until** takes the interval (a, b) and the streams ϕ and ψ as parameters. Similarly, the library for Quantitative MTL introduces a parametrized stream to calculate the arithmetic mean of the last k values of a given stream:

```
output Double meanLast <Int k> <Stream Double str> = numr / denom
  where denom=1'fromIntegral (2'min 'k (instantN[now])) ; numr=sumLast k str [now]
```
which takes as parameters the window size **k** and the stream **str**. The denominator is the minimum of k and instantN, converted to **Double**. The numerator is the sum of the last k values in str. Polymorphosim allows us to generalize this definition to any Haskell type as long as it is Fractional, Equalizable and Streamable, using the following stream signature instead (and the same expression):

**output (Eq a, Fractional a, Streamable a) => a meanLast <Int k> <Stream a str>**

<sup>5</sup> All libraries, definitions and examples are available open-source in the GitHub repository and at https://software.imdea.org/hlola/specs.html.

### 3 Example Specifications

In this section we show a collection of HLola specifications to demonstrate the capabilities of HLola to define stream based monitors.

*Temporal Logics.* HLola allows us to easily define, in a declarative way, many specifications written in temporal logic. The HLola distribution contains many LTL examples, including a sender/receiver model from [6], and other temporal logics. Consider the following MTL property from [20]: (*alarm* → ([0,10]*allClear* ∨ [10,10]*shutdown*)), which includes deadlines between environment events and the corresponding system responses, stating that that an *alarm* is followed by a *shutdown* event in exactly 10 time units unless *allClear* is received. This is defined in HLola as follows:

```
format JSON
use library MTL
#HASKELL
data Event = Alarm | AllClear | ShutDown deriving (Generic,Read,FromJSON,Eq)
#ENDOFHASKELL
input Event event
define Bool allClear = event [now] === 'AllClear
define Bool shutdown = event [now] === 'Shutdown
define Bool alarm = event [now] === 'Alarm
output Bool property = alarm [now] 'implies' (willClear[now] || willShutdown[now])
  where willClear = eventually (0,10) allClear
        willShutdown = eventually (10,10) shutdown
```
*Pinescript example.* TradingView is an online charting platform for stock exchange, which offers the Pinescript language to query stock time series. Pinescript queries are then run in the company's servers. We have implemented the indicators of Pinescript in HLola as a library, and we have implementated a trading strategy6 using the HLola Pinescript library. Compared to Pinescript, HLola offers formal semantics, runtime resource guarantees (time and space) and is much more expressive, for example allowing relational queries that involve multiple stocks (their averages, etc).

*UAV specifications.* We have used HLola also for the online monitoring of several properties of UAVs missions. For example: (1) That the UAV does not fly over forbidden regions, and (2) that the UAV is in good position when it takes a picture. The input streams of these two specifications consist of the state of the UAV at every instant and the onboard camera events to detect when a picture is being captured. This specification imports geometric facilities from **theory Geometry2D**, and Haskell libraries **Data.Maybe** and **Data.List**. It then defines custom datatypes to retrieve data from the UAV, which are enclosed in a verbatim HASKELL block. The output stream **all\_ok\_capturing** assesses that, whenever the vehicle is taking a picture, the height, roll and pitch are acceptable and the vehicle is near the target location. The output stream **flying\_in\_safe\_zones** reports if the UAV is flying outside every forbidden region. The output stream **depth\_into\_poly** takes the minimum of the distances between the vehicle position and every side of the forbidden region inside which the vehicle is.

<sup>6</sup> Available at www.tradingview.com/script/DushajXt-MACD-Strategy

```
format JSON
use theory Geometry2D
use library Utils
use haskell Data.Maybe
use haskell Data.List
#HASKELL
data Attitude = Attitude {yaw :: Double, roll :: Double, pitch :: Double}
                        deriving (Show,Generic,Read,FromJSON,ToJSON)
data Target = Target {x :: Double, y :: Double, num_wp :: Double} ...
data Position = Position {x :: Double, y :: Double, alt :: Double} ...
#ENDOFHASKELL
input Attitude attitude
input Vector2 velocity
input Position position
input Double altitude
input Target target
input [[[Double]]] nofly
input [String] events_within
output Bool all_ok_capturing = capturing [now] 'implies'
 (height_ok [now] && near [now] && roll_ok [now] && pitch_ok [now])
output Bool flying_in_safe_zones = 'isNothing (flying_in_poly [now])
output (Maybe Double) depth_into_poly = let
 mSides = '(fmap polygonSides) (flying_in_poly [now])
 distance_from_pos = 'shortestDist (filtered_pos [now])
 in 2'fmap distance_from_pos mSides
 where shortestDist x = minimum.map (distancePointSegment x)
define Bool capturing = ...
define Double filtered_pos_component <(Position->Double) field> <String nm> = ...
define Double filtered_pos_x = filtered_pos_component x "x" [now]
define Double filtered_pos_y = filtered_pos_component y "y" [now]
define Double filtered_pos_alt = filtered_pos_component alt "alt" [now]
define Point2 filtered_pos = 'P (filtered_pos_x [now]) (filtered_pos_y [now])
define Bool near = let target_pos = 'targetToPoint (target [now])
 in 2'distance (filtered_pos [now]) target_pos < 1
 where targetToPoint (Target x y _)=Pxy
define Bool height_ok = filtered_pos_alt [now] > 0
define Bool roll_ok = '(abs.roll) (attitude [now]) < 0.0523
define Bool pitch_ok = '(abs.pitch) (attitude [now]) < 0.0523
define [Polygon] no_fly_polys = ...
define (Maybe Polygon) flying_in_poly = let
 position_in_poly = 'pointInPoly (filtered_pos [now])
 in 2'find position_in_poly (no_fly_polys [now])
```
Intermediate stream **capturing** captures whether the UAV is taking a picture (omitted for brevity). The streams **filtered\_pos\_alt** and **filtered\_pos** represent the location and altitude of the UAV filtered to reduce noise from the sensors. We omit the definition of the filter, which is implemented in **filtered\_pos\_component** The streams **height\_ok**, **roll\_ok**, and **pitch\_ok**, calculate that the corresponding attitude of the vehicle is within certain boundaries. Finally, the intermediate stream **no\_fly\_polys** obtains a set of Polygons from the input forbidden regions (its definition has been omitted), and the stream **flying\_in\_poly** returns the forbidden region in which the vehicle is flying, if any. The artifact attached to this paper includes more UAV specifications, which have been validated in real missions [25].

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **AMulet 2.0 for Verifying Multiplier Circuits***-*

Daniela Kaufmann -, Armin Biere

Johannes Kepler University, Linz, Austria {daniela.kaufmann,biere}@jku.at

**Abstract.** AMulet 2.0 is a fully automatic tool for the verification of integer multipliers using computer algebra. Our tool models multiplier circuits given as and-inverter graphs as a set of polynomials and applies preprocessing techniques based on elimination theory of Gr¨obner bases. Finally it uses a polynomial reduction algorithm to verify the correctness of the given circuit. AMulet 2.0 is a re-factorization and improved reimplementation of our previous multiplier verification tool AMulet 1.0.

#### **1 Introduction**

Formal verification of arithmetic circuits is important to prevent issues like the famous Pentium FDIV bug [28]. Up to now there have been many attempts to verify these circuits, but even today the problem of fully automatic verification of arithmetic circuits, and especially multipliers, is still considered to be hard.

Methods based on decision diagrams [6] rely on manual structural decomposition of the multiplier. Approaches based on satisfiability checking (SAT) are not scalable [3]. Recently progress has been made using theorem provers [29]. However, the multipliers have to be given as SVL netlists, which relies on preservation of hierarchical information.For flattened gate-level multipliers the currently most successful technique uses algebraic reasoning [7, 15, 17, 25, 26]. In this line of work the circuit is modeled as a set of polynomials and the specification is then checked to be implied by the circuit polynomials. For non-experts Chap. 2 of [15] might serve as introduction to bit-level verification using computer algebra.

In our approach [17] we apply a combination of SAT solving and computer algebra. Certain parts of the multiplier, i.e., complex final stage adders that are generate-and-propagate (GP) adders [27], are hard to verify using computer algebra, but are easy to verify using SAT solvers [21]. Therefore we apply adder substitution [17] and replace complex final stage adders by simple ripple-carry (RC) adders. The equivalence of the adders is verified using SAT solvers. The correctness of the simplified multiplier is shown using computer algebra [17].

This tool paper presents AMulet 2.0, a successor of AMulet 1.0 [17,19]. AMulet 2.0 reads multipliers given as and-inverter graphs (AIG) [22] and fully automatically applies adder substitution and verifies the (simplified) circuit. Furthermore, certificates can be generated in the Nullstellensatz proof format [16] or in the practical algebraic calculus (PAC) [20] to validate the verification results.

<sup>-</sup>This work is supported by the LIT AI Lab funded by the State of Upper Austria.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 357–364, 2021. https://doi.org/10.1007/978-3-030-72013-1 19

AMulet 2.0 is a modular C++ re-implementation of AMulet 1.0 (while AMulet 1.0 consists of a single C file). AMulet 2.0 is not only a standalone tool but also serves as a polynomial reasoning framework, i.e., parts can easily be integrated into different workflows, cf. Sect. 4. AMulet 2.0 still provides the same functionality as AMulet 1.0, but with improved algorithms, cf. Sect 5, based on the same theory [15, 17]. In this paper we focus on novelties of AMulet 2.0 and refer the reader to [19] for an introduction to AMulet 1.0.

#### **2 Circuit Verification using Computer Algebra**

AMulet 2.0 takes as input signed or unsigned integer multipliers C, given as AIGs, with 2n input bits a0,...,a<sup>n</sup>−<sup>1</sup>, b0,...,b<sup>n</sup>−<sup>1</sup> ∈ {0, 1} and output bits s0,...,s2n−<sup>1</sup> ∈ {0, 1}. We denote the internal AIG nodes by l1,...,l<sup>k</sup> ∈ {0, 1}. Let <sup>Z</sup>[X] = <sup>Z</sup>[a0,...,a<sup>n</sup>−<sup>1</sup>, b0,...,b<sup>n</sup>−<sup>1</sup>, l1,...,lk, s0,...,s2n−<sup>1</sup>]. The multiplier C is correct iff for all possible inputs ai, b<sup>i</sup> ∈ {0, 1} the specification L = 0 holds:

$$\mathcal{L} = -\sum\_{i=0}^{2n-1} 2^i s\_i + \left(\sum\_{i=0}^{n-1} 2^i a\_i\right) \left(\sum\_{i=0}^{n-1} 2^i b\_i\right) \tag{1}$$

For signed multipliers the most significant bits s2n−<sup>1</sup>, a<sup>n</sup>−<sup>1</sup>, and b<sup>n</sup>−<sup>1</sup> determine the sign and the weights have to be negated, i.e., 2<sup>2</sup>n−<sup>1</sup> becomes <sup>−</sup>2<sup>2</sup>n−<sup>1</sup>.

The semantics of each AIG node implies a polynomial relation, e.g., u = v∧¬w implies <sup>−</sup><sup>u</sup> <sup>+</sup> <sup>v</sup> <sup>−</sup> vw = 0. Let <sup>G</sup>(C) <sup>⊆</sup> <sup>Z</sup>[X] be the set of polynomials that contains for each AIG node the corresponding polynomial relation. Additionally, all variables x ∈ X are Boolean and we enforce this property by the set of Boolean value constraints <sup>B</sup>(X) = {x(1 <sup>−</sup> <sup>x</sup>) <sup>|</sup> <sup>x</sup> <sup>∈</sup> <sup>X</sup>} ⊆ <sup>Z</sup>[X]. The polynomials in G(C) ∪ B(X) are ordered according to a lexicographic order, such that the output variable of a gate is always greater than the inputs of the gate [23].

Let <sup>J</sup>(C) = <sup>G</sup>(C) <sup>∪</sup> <sup>B</sup>(X)<sup>⊆</sup> <sup>Z</sup>[X] be the ideal generated by <sup>G</sup>(C) <sup>∪</sup> <sup>B</sup>(X). The circuit fulfills its specification if and only if we can derive that L ∈ J(C) [17]. We showed in [17] that <sup>G</sup>(C) <sup>∪</sup> <sup>B</sup>(X) is a D-Gr¨obner basis [2] for <sup>J</sup>(C) <sup>⊆</sup> <sup>Z</sup>[X]. Thus, the correctness of the circuit can be established by reducing L by the polynomials G(C) ∪ B(X) and checking whether the result is zero.

However, simply reducing the specification by G(C) ∪ B(X) leads to large intermediate results [24]. Hence, we eliminate variables in G(C) ∪ B(X) prior to reduction to yield a more compact D-Gr¨obner basis [17], which boils down to simple substitutions, but relies on the elimination theorem of Gr¨obner bases [9].

#### **3 Usage**

AMulet 2.0 is available at http://fmv.jku.at/amulet2 and is published as open source under the MIT license. AMulet 2.0 relies on the AIGER library [5] and the GMP library [10]. The AIGER library is provided together with the source code of AMulet 2.0, the GMP library needs to be pre-installed by the user. AMulet 2.0 is compiled executing "./configure.sh && make".

In a complete workflow one should first apply adder substitution, using the substitution mode of AMulet 2.0, to make sure that a potential complex final stage adder is replaced by a simple RC adder. Afterwards, one of the two modes, the verification mode or certification mode, can be applied to verify the (simplified) multiplier, which we will call in the following rewritten multiplier. If it is known that the final stage adder is not a complex GP adder, the substitution step can be omitted. We present a complete demonstration for the unsigned 64-bit multiplier <bpwtcl.aig>, which is included in the complementary material [14]. The output of AMulet 2.0 can be seen in the corresponding log-files that are also included in the artifact.

**Adder Substitution.** First we apply adder substitution by running

./amulet -substitute bpwtcl.aig miter.cnf rewritten.aig [options]

If the multiplier computes multiplication of signed integers the option "-signed" has to be involved, because the signedness is part of the circuit specification.

If adder substitution can be applied successfully, the generated miter is written to <miter.cnf> and the rewritten multiplier to <rewritten.aig>. Otherwise, the input multiplier will be written to <rewritten.aig> and a trivially unsatisfiable CNF is written to <miter.cnf>. The file <miter.cnf> has to be given to a SAT solver, e.g. Kissat [4], which is then expected to return unsatisfiable. The rewritten multiplier can be verified or certified using AMulet 2.0.

**Verification.** Verification is executed by

./amulet -verify rewritten.aig [options]

As for adder substitution, one has to invoke the option "-signed" for signed multipliers. Furthermore, the option "-no-counter-examples" is available, which turns off generation and saving of counter examples in <rewritten.cex>, in the case when the multiplier in <rewritten.aig> is incorrect.

**Certification.** Certification is applied using

./amulet -certify rewritten.aig out.pol out.prf out.spc [options]

In this mode, AMulet 2.0 verifies the multiplier and automatically generates proof certificates, which can be checked by corresponding proof checkers. AMulet 2.0 supports two proof formats, Nullstellensatz proofs [1,16] and PAC proofs [20] based on the polynomial calculus [8]. The default proof format is the Nullstellensatz proof, because it generates smaller proof files and is faster to check. Proofs in the PAC format can be generated using the option "-pac". All options of the verification mode are available too.

The proofs are stored in the provided files <out.pol>, <out.prf>, and <out.spc>. The file <out.pol> contains the gate constraints, the second file <out.prf> the core proof in the selected proof format and the third file <out.spc> the specification of the multiplier. The generated proofs can be given to the proof checkers Nuss-Checker [16] for Nullstellensatz proofs or to the proof checkers Pacheck [20], or Pasteque ` [20] for PAC proofs.

**Fig. 1.** Architecture of AMulet 2.0.

### **4 AMulet 2.0**

In this section we present the architecture of AMulet 2.0 and discuss novel optimizations. The design of AMulet 2.0 is shown in Fig. 1. In contrast to AMulet 1.0, which consists of one single C file, AMulet 2.0 is split into components, which also allows to integrate only parts, e.g., the polynomial library or the polynomial solver, in different workflows, cf. the provided demos in the artifact [14]. AMulet 2.0 is implemented in C++11 and consists of around 6 000 lines of code. It relies on the AIGER library [5] to process the given AIG and the GMP library [10] to represent large integers.

The mode of AMulet 2.0 is triggered by the command line input, cf. Sect. 3. In substitution mode, AMulet 2.0 parses the AIG, allocates the internal gate structure, and invokes the substitution engine for adder substitution. In verification mode, AMulet 2.0 reads the AIG and initializes the gate structure. Afterwards, the circuit is verified in the polynomial solver using polynomial operations of the polynomial library. In certification mode proofs are generated in addition. In the following we present the individual components of AMulet 2.0.

**Parser Module** AMulet 2.0 checks whether the given AIG circuit fulfills the requirements described in Sect. 2, i.e., the AIG circuit has an even number of inputs and an equal number of outputs. The AIG module wraps functions of the external AIGER library that are needed to process the input file.

**Gate Library** After parsing we allocate a gate for each AIG node, which includes structural information, such as dependencies, or whether the gate represents an input/output or an XOR-gate. Furthermore, each gate is linked to a unique variable. If the given AIG is verified or certified, AMulet 2.0 also initializes the gate constraints and creates the specification polynomial L ∈ <sup>Z</sup>[X].

**Substitution Engine** In substitution mode, AMulet 2.0 applies heuristic pattern matching to identify GP adders [17]. In AMulet 2.0 we enhanced the identification heuristics and cover special cases that are not considered in AMulet 1.0. Thus, AMulet 2.0 is able to detect more GP adders than AMulet 1.0. After a positive GP pattern match, AMulet 2.0 generates an equivalent RC adder and replaces the GP adder by the RC adder. A bit-level miter is generated in CNF to verify the equivalence of the adders. The rewritten multiplier and the CNF miter are printed to the provided output files.

**Polynomial Solver** The polynomial solver is based on the solving engine of AMulet 1.0 [19] and is used to verify or certify the given multiplier. In a nutshell, the polynomial solver first applies preprocessing by eliminating selected variables. Afterwards, the remaining variables are ordered into column-wise slices, such that we can apply our incremental verification algorithm [18], where we split the specification L into multiple polynomials and verify the multiplier by deriving the correctness of each slice using polynomial reduction. The necessary polynomial operations are implemented in the **Polynomial Library**.

In AMulet 2.0 we eliminate variables before ordering them, while in AMulet 1.0 it is the other way around. We eliminate all internal gates of the XOR-structures and all single-parent nodes in the AIG. Thus, fewer variables are considered for ordering, which improves computation time of AMulet 2.0.

Furthermore, we include a novel XOR-based slicing approach in AMulet 2.0, which relies on the fact that many multiplier architectures use XOR-skeletons to compute the output bits. We identify these skeletons and assign all nodes of a skeleton to the same slice. Gates occurring between XOR-skeletons are assigned to the smaller (less significant) slice. Hence, after two iterations all slices are fixed, which improves slicing compared to AMulet 1.0. All variables that are not assigned to slices, e.g., gates used to compute the partial products in Booth encoding [27], are eliminated from the gate structure.

In few cases, where we cannot identify XOR-skeletons, e.g., in multipliers containing a carry-select adder, we fall back on the slicing approach of AMulet 1.0: We slice based on input cones and eagerly move gates between slices to reduce the number of carries, by iterating multiple times over the variables.

After assigning gates to slices, AMulet 2.0 reduces the slice-wise specifications incrementally by the sliced gate constraints and checks whether the final result is zero, following the implementation of AMulet 1.0. If the final remainder is not zero, AMulet 2.0 detects counter examples, i.e., input assignments for which the multiplier circuit computes an incorrect result.

In certification mode, AMulet 2.0 tracks polynomial operations in the selected proof format, i.e., Nullstellensatz or PAC, and prints gate constraints, the generated proof, and the specification L to the provided files.

**Polynomial Library** The polynomial library implements the arithmetic operations for addition and multiplication of polynomials (by constants), and division by terms. Since all variables represent Boolean values, we always reduce exponents greater than one automatically to one, i.e., we assume x · x = x.

Polynomials are represented as linked lists of monomials. Each monomial consists of a coefficient, represented using the GMP library, and a term. Terms are linked lists of variables, which are internally shared using a hash table.

In AMulet 1.0 we do not share monomials and make hard copies in the few occasions when a monomial needs to be copied. This has the benefit that we can simply modify coefficients of the monomials, e.g., during addition. In our experiments we observed that allocating new GMP objects is actually quite time consuming, and therefore we now share monomials in AMulet 2.0, using reference counting, which decreases verification time by a factor of two.

**Fig. 2.** Verification of AOKI multipliers (left) and of large multipliers (right), in seconds.

#### **5 Evaluation**

In our experiments we use an Intel Xeon E5-2620 v4 CPU at 2.10 GHz (with turbomode disabled) with a memory limit of 128 GB. The time is listed in seconds (wall-clock time). We compare AMulet 2.0 to our previous tool AMulet 1.0 and to the most recent related work RevSCA, RevSCA-2.0 [25] and ABC-based work of [7] on multiplier verification using computer algebra, where circuits are given as AIGs. The tool of [26] is not yet available. We consider two versions of AMulet 1.0: (i) AMulet 1.0 as published in [17], (ii) AMulet 1.5 a slightly improved version [13] with new heuristics for detecting GP adders. The experimental data is included in the artifact [14].

In our first experiment we consider the comprehensive AOKI benchmark set [12], which provides 384 signed and unsigned integer multiplier architectures up to input bit-width 64, also covering Booth encoding. We consider all 384 architectures of bit-width 64. The time limit is set to 300 seconds. The results are shown on the left side of Fig. 2, where it can be seen that AMulet 2.0 is the only tool that is able to verify the complete benchmark set. RevSCA only supports verification of unsigned integers. ABC-based work of [7] uses an optimization, which only works for simple multiplier architectures. Enabling this optimization on the more involved AOKI benchmarks leads to incompleteness. Without enabling it [7] either produces a segmentation fault or exceeds the time limit. Thus there are no results for [7] on the left side of Fig. 2.

In our second experiment we generate benchmarks of simple multipliers up to input size 2 048, using scripts by Arist Kojevnikov [11]. The time limit is set to 86 400 seconds (24 h) and the results are shown on the right side of Fig. 2. It can be seen that AMulet 2.0 outperforms all competitor tools and is an order of magnitude faster on large multiplier circuits.

#### **6 Conclusion**

We presented AMulet 2.0, a fully automatic tool for verifying multiplier circuits given as AIGs. AMulet 2.0 is a re-factorization and re-implementation of our previous verification tool AMulet 1.0 [17, 19] and successfully verifies a large set of multiplier architectures. In the future we want to directly integrate a SAT solver into AMulet 2.0 and provide language bindings, e.g. for Python.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **RTLola on Board: Testing Real Driving Emissions on your Phone***-*

Sebastian Biewer1(-) , Bernd Finkbeiner<sup>2</sup> , Holger Hermanns1,<sup>3</sup> , Maximilian A. K¨ohl<sup>1</sup> , Yannik Schnitzer<sup>1</sup> , and Maximilian Schwenger<sup>2</sup>

<sup>1</sup> Saarland University, Saarland Informatics Campus, Saarbr¨ucken, Germany biewer@depend.uni-saarland.de

<sup>2</sup> CISPA Helmholtz Center for Information Security, Saarbr¨ucken, Germany <sup>3</sup> Institute of Intelligent Software, Guangzhou, China

**Abstract.** This paper is about shipping runtime verification to the masses. It presents the crucial technology enabling everyday car owners to monitor the behaviour of their cars in-the-wild. Concretely, we present an Android app that deploys rtlola runtime monitors for the purpose of diagnosing automotive exhaust emissions. For this, it harvests the availability of cheap bluetooth adapters to the On-Board-Diagnostics (obd) ports, which are ubiquitous in cars nowadays. We detail its use in the context of Real Driving Emissions (rde) tests and report on sample runs that helped identify violations of the regulatory framework currently valid in the European Union.

#### **1 Introduction**

In the last decade, far more than 600 million cars have entered the streets worldwide [10]. With very few exceptions, each of these cars is equipped with a standardized On-Board-Diagnostics (obd [16]) interface. Five years ago it surfaced that many of the cars out there do not adhere to the regulatory framework with which they are supposed to comply. For example, a number of undeniable proofs of tampered emission cleaning systems in passenger cars [5,3,14] are known by now. When this scandal first surfaced, the regulations imposed by the authorities were related to isolated tests carried out under lab-like conditions on chassis dynamometers [20,4]. Since then, there has been a growing understanding that emission and fuel or battery consumption measurements should best take place in a realistic context. Hence, the first test framework for testing on public roads, the Real Driving Emissions (rde) test has been developed [19,17] and is being rolled out for car model approval in Europe and other entities of jurisdiction.

The rde regulation specifies the conditions under which a car trip qualifies as a valid rde test. These conditions refer to the trajectory driven, duration,

<sup>-</sup> This work is partly supported by DFG grant 389792660 as part of TRR 248 – CPEC, by the European Research Council (ERC) grants 683300 (OSARES), 695614 (POWVER), and 966770 (LEOpowver), and by the Key-Area Research and Development Program Grant 2018B010107004 of Guangdong Province.

<sup>©</sup> The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 365–372, 2021. https://doi.org/10.1007/978-3-030-72013-1\_20

altitudes, speeds, and on the dynamics of the driving profile [17]. By combining the information available at the obd port and the position of the car, it is possible to cast rde testing into a runtime monitoring [21,13,12] problem. Indeed we have shown in earlier work [9] how to formalize the rde regulations in rtlola [7,8], a real-time extension of the stream-based specification language Lola [6]. Lola combines the ease-of-use of rule-based specification languages with the expressive power of heavy-weight scripting languages or temporal logics. The eponymous framework generates runtime monitors for such specifications, which were successfully deployed, for instance, on unmanned aircraft [18,2].

An official rde test requires a calibrated portable emissions measurement system (pems) to be connected to the car's exhaust pipe while driving the test, so as to correctly quantify the amount of exhaust emissions induced. The purchasing costs of a pems are in the order of 250,000 which is close to unaffordable even in a research context. However, many car models expose a variety of diagnosis data through obd and an obd-to-Bluetooth adapter can be purchased for around 10. The data exposed depends on the type of engine, emission cleaning system, and other components in use. There are several minimal combinations of obd data giving good approximations of emitted gases. In particular, various car models expose the sensor readings of their after-treatment NO<sup>x</sup> sensor deployed at the rear of the exhaust pipe.

Contribution. This paper presents LolaDrives, an Android app enabling car owners to carry out real driving emission tests with little investment. Prerequisites are (i) an Android phone, (ii) an obd-to-Bluetooth adapter, and (iii) a car model that does indeed expose the needed values via obd. If the latter is not the case, the app can still serve the user as a convenient personal monitoring and logging device for the many quantities exposed while driving.

A structural overview of LolaDrives is depicted in Figure 1. At the core of the app is an Android version of the rtlola engine [7]. The engine is strictly separated

from the data acquisition and the rtlola rde specification. This separation will make it possible to reuse the

Sensor Data Drive Record RTLola RDE Specification Data Donation UI

Fig. 1: LolaDrives

approach in other runtime monitoring contexts, be it of espresso machines via usb, or drones via Wi-Fi. In both cases, it would especially be the specification in rtlola that needs to change, not the engine. Car sensor data is acquired via Bluetooth from the obd device, and combined with location data provided by Android's gps service. The data streams are recorded for later diagnosis. Anticipating future application scenarios involving crowd sourcing car data, we advertise the app as part of a car data platform (cdp), which includes an upload facility for donating drive records. While driving, the app's user interface (ui) displays diagnostic information to the user, both regarding the correct execution of an rde test drive and the car's emission data. We will detail the separate components of the app next.

Notably, the lack of any calibration and the unknown precision of the data exposed by the car manufacturer via obd make it impossible to consider the rde test results reported by LolaDrives as anything more than indicators of the car's rde behaviour in a legal sense.

#### **2 RDE Monitoring on Android**

The primary feature of LolaDrives is to monitor the progress of an rde test drive. For this, it uses the rtlola monitoring framework. This bridges the gap between formally sound concepts and every-day use cases. While rtlola does target a broad audience, that audience is still intended to be expert users rather than the general public. It requires users to execute three tasks: provide a formal specification of the intended behaviour, supply input data, and interpret the monitor's output. LolaDrives reduces these tasks to minimal action points for end-users.

Specification. No end-user input is required with respect to the rtlola specification. The definition of what is a valid rde test is fixed [9] and strictly follows the constraints imposed by the regulation issued by the European Commission [17]. These constraints concern the driving behaviour and layout of the route. Some of them apply universally, e.g., the ambient temperature must range between 273 K and 303 K. For others, the rde regulation differentiates three environments: urban, rural, and motorway with different environments imposing different restrictions on the car, such as an average velocity between 15 and 40 km/h in an urban environment. A segment refers to all parts of the test drive in which the car operates in a certain environment. While segments may be interrupted, each one needs to occupy a specific share of the total distance travelled.

Input Data Provision. LolaDrives uses sensor readings provided over the obd interface as input data. The user only has to plug the obd-to-Bluetooth adapter in the respective port at (or close to) the dashboard of her car and pair it with her phone. The car then automatically transmits data to the phone while driving.

Interpretation of Output. While driving, LolaDrives assists the user in the critical task of satisfying all the constraints that make up a valid rde. It provides feedback on the driving behaviour indicating which requirements on the test are satisfied to what extent, and which still need attention. Furthermore, it evaluates the measured exhaust data and informs the user of whether or not the car violates emission regulations. Both of these tasks require an online analysis of driving data. For this analysis, LolaDrives uses the rtlola monitoring framework.

Foundational Underpinning. rtlola [8,7] is a stream-based runtime verification framework. The rtlola monitor analyses sequences of input data to assess whether or not the system complies with the specification. The specification language has a formal semantics which enables devising provably correct monitoring algorithms [15].

An rtlola specification consists of input stream declarations where each input stream corresponds to a source of input data such as the NO<sup>x</sup> sensor of the car. Output stream declarations then spell out how to filter and refine the input data. For this, rtlola provides primitives for complex analyses such as sliding window aggregation for common aggregation functions. Further, the specification contains binary trigger conditions. The satisfaction of such a condition constitutes a violation of the specification and prompts the monitor to immediately relay a warning to the user. The following snippet is an extract of an rtlola specification for rde test drives [11]:

```
input velo_kmph, accel_mpss: Float64
output is_rural := ... output rural_avg_velo := ...
output rural_dyn : Float64 @1Hz filter: is_rural := velo_kmph *
    accel_mpss / 3.6
output rural_pctl_dyn : Float64 @1Hz :=
   rural_dyn.aggregate(over: 7200, using: pctl(95)).defaults(to: 0.0)
trigger rural_pctl_dyn > (0.136 * rural_avg_velo + 14.44)
   ∧ rural_avg_velo <= 74.6
```
This specification fragment checks whether the car complies with the rde regulations regarding the driving dynamics in the rural segment<sup>4</sup>. The first line declares two input streams representing the velocity in km h−<sup>1</sup> and acceleration in m s−<sup>2</sup> supplied by the car. The third line computes the dynamics in m<sup>2</sup> s−<sup>3</sup>, by multiplying the velocity and acceleration. The regulations then demand that the <sup>95</sup>th percentile of the dynamics are no greater than 0.136·vavg +14.44 where <sup>v</sup>avg is the average velocity of the vehicle. The computation of the velocity and the dynamics only consider sensor readings obtained while in the rural segment. The full specifications are publicly available [1]. Note that while the specification is relatively easy to design and understand for computer scientists and engineers, it exceeds the expertise expectable of laymen users. However, it is not necessary for them to be confronted with the full potential of the language because LolaDrives comes preconfigured with a set of rde-specific specifications.

As can be seen, the requirements on the end-user are minimal. Thus, the setup enables users to conduct rde test drives and assess the emission-behaviour of their cars without requiring them to understand the underlying technology.

### **3 User Experience**

This section discusses the user perspective on LolaDrives. After a general overview, we report on the use of LolaDrives for conducting rde test drives with a rented vehicle (the precise car model being unknown upfront).

Overview. The preparation of the test requires the user to plug the obd-adapter into the obd-port of the car. After starting car and app, LolaDrives receives data packets and determines the sensor profile of the car, assuming phone and adapter are paired via Bluetooth. Some sensor profiles provide insufficient data to conduct an rde test drive. In this case, the app is still convenient to use for real-time displaying and logging the available data regardless of rde regulations,

<sup>4</sup> See Annex IIIA, Appendix 7a, 3.1.3 in the eu regulations [17].

(a) Diagnostics view displays the most recent diagnostics information.

(b) rde progress view displays current state parameters of the test drive.

(d) Map of the second rde test drive.

Fig. 2: ui of LolaDrives displaying different views and a map of a test route.

see Figure 2a. If the data suffices, the app selects an appropriate specification and initializes the rtlola monitor. LolaDrives then starts filtering and visualising the data output and trigger notifications provided by the monitor.

After successful setup, the ui switches to an rde guiding view (Figure 2b). From top to bottom, it shows the total time, which must be between 90 and 120 min to finish the test, and the total distance travelled. The next line indicates the current state of the conditions for a valid rde test drive disregarding emission data. In the screenshot, the drive is still in progress and inconclusive, indicated by the question mark. Instead, the ui can also indicate success or failure. The latter verdict can occur far before the time limit is reached, caused by an irrecoverable situation such as transgression of the 160 km h−<sup>1</sup> speed limit. Note that the indicator reports the current status if the test drive were to end in this moment. Together with the regulatory constraints, this implies that the current verdict can alternate between success and failure from minute 90 to 120. As there is no specific point in time when the test ends, the app continues to compute statistics until the tester manually stops it or the 120 min mark is reached. Beneath the status indicator is the green NO<sup>x</sup> bar displaying the total NO<sup>x</sup> emissions. The vertical red bar denotes the permitted threshold of 168 mg km−<sup>1</sup>.

The next three ui groups represent the progress in each of the distinct segments: urban, rural, and motorway. Each group consists of two horizontal bars. The gray progress bar displays the distance covered in the respective segment. The vertical blue indicators denote lower and upper bounds as per official regulation, for an expected trip length of 83 km. The blue bar below the gray one


Table 1: Aggregation of the emission data based on the cdp.

illustrates two different metrics for the driving dynamics. Both dots need to remain below/above their thresholds. A more aggressive acceleration behaviour shifts the dots to the right and a passive driving style to the left.

Test Drive. The technical framework and visual feedback of the app were tested in two rde test drives. Both tests were conducted with an Audi A6 Avant 45-TDI hybrid diesel, which is approved as Euro 6d-TEMP (DG) with an NO<sup>x</sup> threshold of 80 mg km−<sup>1</sup> under lab conditions and 168 mg km−<sup>1</sup> for rde conditions. Among the diagnosis parameters available within this car are vehicle and engine speed, ambient temperature, engine fuel rate, mass air flow, and two NOx-sensors one in front and one behind the emission cleaning system in the exhaust pipe. With this data, exhaust mass flow and fuel consumption can be computed, from which the total amounts of NO<sup>x</sup> and CO<sup>2</sup> can be derived [11]. In both drives, the driving dynamics were close to the allowed maximum, in the first test below and in the second test above the threshold, so the second test drive did not result in a valid rde test. In both cases, the app correctly confirmed the satisfaction and violation of the rde criteria. In the first drive, the app reported an average NO<sup>x</sup> emission of 214 mg km−<sup>1</sup>. This constitutes a violation of the regulation.

The app also allows for inspection of the driving data in a plotted form (Figure 2c). Figure 2d shows the route of an rde test drive. The first half of the time constituted the urban segment (green). The next 30-40% of the test mainly consisted of the rural segment (purple) followed by the motorway segment (red). Data Harvesting. For further analysis, data can be uploaded to a cloud storage which is part of the car data platform (cdp). This platform provides a uniform way to harvest data by specifying a format for collection, analysis, and exchange of this data. cdp builds upon a json format (https://json-schema.org/) containing timestamped events such as an obd response, including its raw payload. As an example, the data presented in Table 1 is an aggregation of the rde test

drives mentioned above obtained by post-processing the data.

### **4 Conclusion**

LolaDrives pushes runtime verification technology into cars and phones of everyday users. The app is available in Google Play [1]; a version for iOS is already initiated. Moreover, the car data platform constitutes a crowd-sourcing initiative for car data with the intention to enable large scale analyses of emission data beyond a single trip and a single car model.

#### **References**


volkswagen-diesel-cost-30-billion/index.html, Online; accessed: 2020- 10-15


**Legal Attribution** Android, Google Play and the Google Play logo are trademarks of Google LLC.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/ 4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### Replicating Restart with Prolonged Retrials: An Experimental Report*-*

Carlos E. Budde (-) and Arnd Hartmanns (-)

University of Twente, Enschede, The Netherlands {c.e.budde,a.hartmanns}@utwente.nl

Abstract Statistical model checking uses Monte Carlo simulation to analyse stochastic formal models. It avoids state space explosion, but requires rare event simulation techniques to efficiently estimate very low probabilities. One such technique is Restart. Villén-Altamirano recently showed—by way of a theoretical study and ad-hoc implementation—that a generalisation of Restart to *prolonged retrials* offers improved performance. In this paper, we demonstrate our independent replication of the original experimental results. We implemented Restart with prolonged retrials in the FIG and modes tools, and apply them to the models used originally. To do so, we had to resolve ambiguities in the original work, and refine our setup multiple times. We ultimately confirm the previous results, but our experience also highlights the need for precise documentation of experiments to enable replicability in computer science.

#### 1 Introduction

In stochastic timed systems, the time between faults, customer interarrival times, transmission delays, or exponential backoff wait times follow (continuous) probability distributions. Probabilistic model checking [3] can compute dependability metrics like reliability and availability in the Markovian case. To evade state space explosion and evaluate non-Markovian systems, statistical model checking (SMC [2]) has become a popular alternative. At its core, SMC is Monte Carlo simulation for formal models. It faces a runtime explosion when estimating the probability p of a *rare event* with a sufficiently low error, e.g. an error of <sup>±</sup>10−<sup>10</sup> for <sup>p</sup> <sup>≈</sup> <sup>10</sup>−<sup>9</sup> (i.e. a *relative* error of <sup>0</sup>.1). *Rare event simulation* (RES) techniques [17] address this problem. They can broadly be categorised into *importance sampling* and *importance splitting*. The former changes the probability distributions while the latter changes the simulation algorithm to make the rare event more likely. Both techniques then compensate for these changes in the statistical evaluation. RES has garnered the interest of mathematicians and computer scientists alike. The scientific outcomes range from theoretical studies of a RES technique's limit behaviour and optimality [8,14,16] over experimental validation on Matlab studies or ad-hoc implementations [10,11,19] to application

<sup>-</sup> Authors are listed alphabetically. This work was supported by NWO via project no. 15474 (SEQUOIA) and VENI grant no. 639.021.754.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 373–380, 2021. https://doi.org/10.1007/978-3-030-72013-1\_21

reports using larger case studies [5,12,18] as well as automated tools [4,6,15,18] that accept a loss of optimality in exchange for practicality.

Two recent papers showed theoretically [21] and empirically [19] that *prolonging retrials* in the Restart importance splitting technique [22] reduces the required number of samples for the same error, with optimal runtime around prolonging by 1 to 2 levels. The models and parameters used in [19] are described in supplementary material [20], but the implementation is not publicly available. In this paper, we demonstrate our *replication* of the results of [19,21], where replication "means that an independent group can obtain the same result using artifacts which they develop completely independently" in the ACM terminology [1]. To this end, we implemented Restart with prolonged retrials (Restart-P) in the FIG rare event simulator [4] and the modes statistical model checker [7] of the Modest Toolset [13]. We recreated the models in the IOSA and Modest languages, and ran experiments following the original setup.

Our experiments confirm the behaviour and performance improvements of Restart-P reported in [19,21]. However, we encountered ambiguities in the textual and pictorial descriptions of Restart-P and the experimental setup in the original papers, some of which we could only resolve with input from the author of [19,21]. Different parts of our work thus reside on different levels between replication and *reproduction* (which "means that an independent group can obtain the same result using the author's own artifacts" [1]). Throughout the paper, we document where we achieved fully independent replication, where information from private communication was needed, and where we had to ultimately resort to requesting and inspecting the source code for the original implementation.

The contribution of this paper is thus threefold: (1) We provide pseudocode for Restart-P in Sect. 2 that clarifies the technical details w.r.t. [19,21]. (2) We demonstrate the new Restart-P capabilities of FIG and modes by replicating the original experiments in Sect. 3. (3) We reflect on our experience (as practical computer scientists) in independently replicating existing (theoreticallyflavoured) work.

### 2 Restart with Prolonged Retrials

Let a stochastic timed discrete-event model be given as a tuple S, s0, step, F of a set of states S, an initial state s<sup>0</sup> ∈ S, a function step : S → [0, ∞) × S where step(s) samples a random path from s to the next event and returns a pair t, s of its duration and next state, and a subset of rare event states F ⊆ S. A simulation *run* is a sequence of states obtained by repeatedly applying step. Models with general probability distributions encode their memory in the states.

Importance splitting uses an *importance function* f<sup>I</sup> : S → [0, ∞) indicating "how close" a state is to the rare event. Partition the range of f<sup>I</sup> into k + 1 nonempty intervals to obtain a *level function* f<sup>L</sup> : S → { 0,...,k } with fL(s1) < fL(s2) ⇒ f<sup>I</sup> (s1) < f<sup>I</sup> (s2). For simplicity, assume f<sup>I</sup> (s0)=0 and step(s) = t, s ⇒ fL(s ) ≤ fL(s)+1 (a step moves up by at most one level). Let C<sup>i</sup> def = { s | fL(s) ≥ i }. Then "thresholds T<sup>i</sup> of f<sup>I</sup> are defined so that each set C<sup>i</sup> is associated Input: model S, s0, step, F, fL, fS, prolongation depth j, max. sim. time Tmax t<sup>F</sup> := 0, list ξ := {| s0, 0, 0, 0 |} *// state*, *time*, *creation level*, *last-split level* while <sup>ξ</sup> <sup>=</sup> <sup>∅</sup> do *// run all trials to end* s, t, create , split := ξ.get-remove() *// data of current trial* while t<Tmax do t , s := step(s) *// simulate to next change in state* t := min{ t , Tmax − t }, t := t + t *// advance time, at most to* Tmax if s ∈ F then t<sup>F</sup> := t<sup>F</sup> + t / split <sup>i</sup>=1 fS(i) *// accumulate weighted rare time* , := fL(s), fL(s ), s := s if < then *// trial went* down*:* if =0= create then split := 0 *// reset main trial at level 0* else if = 0 ∨ < create − j then break *// end retrial if 0 or* j *down* else split := min( split, + j) *// else update last-split level* else if > split then *// trial went* up *far enough:* split := *// update last-split level*

foreach i ∈ {1,...,fS(

return t<sup>F</sup> *// return accumulated weighted time spent in rare states*

, t,

, split) *// split off retrials*

Algorithm 1: Restart with prolonged retrials of depth j (Restart-P<sup>j</sup> )

)−1} do ξ.add(s

with <sup>f</sup><sup>I</sup> <sup>≥</sup> <sup>T</sup>i" [21]. Function <sup>f</sup><sup>S</sup> : { <sup>1</sup>,...,k } → <sup>N</sup> \ { <sup>0</sup> } defines *splitting factors*. f<sup>I</sup> , fL, and f<sup>S</sup> are specified by experts or derived automatically [6]. Importance splitting with Restart starts a run (the *main trial*) from s<sup>0</sup> that, whenever it moves up from s in current level l − 1 to s in level l, spawns fS(l) − 1 new child runs (*retrials of level* l) from s . Retrials end when they move down below their creation level. The trials' weights in probability estimation are appropriately reduced to compensate. Restart with prolonged retrials of depth j, denoted Restart-P<sup>j</sup> , is defined as follows in [21] (shortened and adapted to our notation):

In Restart-P<sup>j</sup> , each of the retrials of level i finishes when it leaves set C<sup>i</sup>−<sup>j</sup> ; that is, it continues until it down-crosses the threshold i−j. If one of these trials again up-crosses the threshold where it was generated (or any other between i−j + 1 and i), a new set of retrials is not performed. If j ≥ i, the retrials are cut when they reach the threshold 0. The main trial, which continues after leaving set C<sup>i</sup>−<sup>j</sup> , potentially leads to new sets of retrials if it up-crosses threshold T<sup>i</sup> after having left set C<sup>i</sup>−<sup>j</sup> . If the main trial reaches the threshold 0, it collects the weight of all the retrials (which has been cut at that threshold) and thus, new sets of retrials of level 1 are performed next time the main trial up-crosses threshold T1.

In addition, [21, Fig. 1] graphically illustrates the behaviour of Restart-P1. The original Restart [22] is Restart-P0. The above textual description clearly conveys the core idea of Restart-P, but we found it to omit three technical details:

– The condition for when an up-going retrial spawns new retrials is more complex than with Restart. We became aware of this when comparing the textual description with the graphical depiction in [21, Fig. 1]. In fact, we need

to keep track of the last level at which a retrial will split, and decrement that value when it moves more than j levels down. (Independent replication.)


must not change when moving down ≤ j levels. (Resembles a reproduction.) We make these details explicit in Algorithm 1, for the case of estimating the long-run average time spent in F (i.e. steady-state simulation). FIG evolved as described above and is thus mostly a replication. modes was extended with prolongations later, using a recursive formulation of the algorithm gleaned from the original code. It thus lacks the complete independence of a replication as per [1].

#### 3 Experiments

Table 2 in [21] provides steady-state estimates, numbers of samples, and runtimes obtained using Restart-P<sup>j</sup> on a Jackson (i.e. Markov) 2-tandem queueing network for j ∈ { 0,..., 4 }. The same data is given in [19] for j ∈ { 0,..., 2 } on a similar system with three queues and a seven-node network, in Jackson and non-Jackson (using Erlang and hyperexponential distributions) variants. The original articles and extra material [20] describe the models, and the experimental setup:


In our replication attempt, we had to resolve the following unspecified aspects:

– The queue capacities C>L are not documented, but influence the estimate: for C close to L, the steady-state probability is underestimated. We settled for <sup>C</sup> = 20 · <sup>L</sup> in FIG's IOSA models (replication); the influence of <sup>C</sup> <sup>−</sup> <sup>L</sup> rapidly diminishes beyond small values. Later, from inspecting the original source


Table 1. Experimental results for the examples considered in [19,21]

code, we found that the queues are practically unbounded (implemented as 32-bit integer counters), which we reproduce in the Modest models for modes.


The original experiments were realised in a single file of C code that represents both the algorithm and the models, specialised to queueing models with transition probabilities and service rates specified in constant arrays. In fact, the code we received implemented the 2-tandem queueing network only. We extended this code with a compile-time choice among the models described in [20], and fixed few small bugs. We thus have four sets of results to compare, shown in Table 1: the original numbers given in [19,21], plus those from our new executions of the adapted code, modes, and FIG. In the table, time is in seconds, pˆ is the estimate, p is the true steady-state probability where it can be derived, and n is the number of samples needed by the statistical evaluation. The adapted code and FIG ran on an Intel Xeon E5-2630 v3 (2.4-3.2 GHz), and modes ran on a Core i7-4790 (3.6-4.0 GHz, 4 physical/8 logical cores) system. The adapted code and FIG are single-threaded whereas modes used 7 simulation threads. The adapted code is tailor-mode for these models, while FIG has to encode them in the more general IOSA framework, making it slower; modes in turn profits from a special-case implementation for CTMC to speed up the Markovian cases. Comparing runtimes *between tools* is thus of limited use. The estimates are the centers of confidence intervals returned by the tools with confidence and relative width as described above. Each estimate, n,time triple was selected from 5 tool executions by picking the one with the median runtime. We underline the best runtimes among values for j. However, the wide confidence intervals (except for 2-tandem), few executions, and in principle unsound stopping criterion that we reproduce from the original experiments mean that results, including best values of j, vary a lot for different random seeds. The original experimental setup is thus insufficient for drawing conclusions about the precise tradeoffs between specific values of j, but may at most expose an overall trend.

Nevertheless, our estimates are mostly within the margin of error around the original or true results. We confirm the main experimental conclusion of [19,21]: as j increases, n decreases, but from some point—mostly j > 1 or 2—runtime increases, due to the overhead of more retrials surviving longer. For the non-Jackson triple tandem network, none of our results matches the numbers of [19]. Since *the original code*, albeit adapted, agrees with FIG and modes rather than with the original results, we suspect an error in [19] or [20] w.r.t. this one model.

#### 4 Conclusion

We demonstrated the extension of the FIG and modes rare event simulation tools to support prolonged retrials in rare event simulation using Restart importance splitting. These implementations and experiments were the outcome of an exercise in independently *replicating* experimental research originally performed in mathematics, from a computer science perspective. We confirm the key findings of the earlier work. At the same time, we document several issues—small but critical technical details of the algorithm and experimental setup—where the publicly available information was insufficient for a completely independent replication. We in particular noticed that replicating randomised/statistical algorithms poses a particular challenge since small errors may result in subtle mis-estimations that are often drowned in the overall statistical error. In the end, however, all issues could be resolved due to the exceptional support, responsiveness, and openness of the original author, José Villén-Altamirano, whom we thank earnestly. However, such support cannot be expected for experimental work in general, in particular where temporary staff like Ph.D. students—who eventually graduate and move to new institutions or industry—perform the bulk of the experiments. This paper thus also highlights the need for computer science and the formal verification community to continue their push for artifact evaluation and archived, publicly available *reproduction* packages. A reproduction package for our experiments is archived at DOI 10.6084/m9.figshare.12269462.

#### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original authors and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **A Web Interface for Petri Nets with Transits and Petri Games** *-*

Manuel Gieseking1(-) , Jesko Hecking-Harbusch<sup>2</sup> , and Ann Yanich<sup>1</sup>

<sup>1</sup> University of Oldenburg, Oldenburg, Germany {gieseking,ann.yanich}@informatik.uni-oldenburg.de <sup>2</sup> CISPA Helmholtz Center for Information Security, Saarbr¨ucken, Germany jesko.hecking-harbusch@cispa.de

**Abstract.** Developing algorithms for distributed systems is an errorprone task. Formal models like Petri nets with transits and Petri games can prevent errors when developing such algorithms. Petri nets with transits allow us to follow the data flow between components in a distributed system. They can be model checked against specifications in LTL on both the local data flow and the global behavior. Petri games allow the synthesis of local controllers for distributed systems from safety specifications. Modeling problems in these formalisms requires defining extended Petri nets which can be cumbersome when performed textually.

In this paper, we present a web interface<sup>1</sup> that allows an intuitive, visual definition of Petri nets with transits and Petri games. The corresponding model checking and synthesis problems are solved directly on a server. In the interface, implementations, counterexamples, and all intermediate steps can be analyzed and simulated. Stepwise simulations and interactive state space generation support the user in detecting modeling errors.

#### **1 Introduction**

Distributed systems consist of several individual components. Each component has incomplete information about the other components. Asynchronous distributed systems have no fixed rate at which components progress but rather each component progresses at its individual rate between synchronizations with other components. Implementing correct algorithms for asynchronous distributed systems is difficult because they have to both work with the incomplete information of the components and for every possible scheduling between the components.

Petri nets [22,21] are a natural model for asynchronous distributed systems. Tokens represent components and transitions with more than one token correspond to synchronizations between the components. Petri nets with transits [9] extend Petri nets with a transit relation to model the data flow in asynchronous

<sup>-</sup> This work has been supported by the German Research Foundation (DFG) through Grant Petri Games (392735815) and through the Collaborative Research Center "Foundations of Perspicuous Software Systems" (TRR 248, 389792660), and by the European Research Council (ERC) through Grant OSARES (683300).

<sup>1</sup> The web interface is deployed at http://adam.informatik.uni-oldenburg.de.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 381–388, 2021. https://doi.org/10.1007/978-3-030-72013-1 22

distributed systems. Flow-LTL [9] is a specification language for Petri nets with transits and allows us to specify linear properties on both the global and the local view of the system. In particular, it is possible to globally select desired runs of the system with LTL (e.g., only fair and maximal runs) and check the local data flow of only those runs again with LTL. A model checker for Petri nets with transits against Flow-LTL is implemented in the tool AdamMC [10].

Petri games [14] define the synthesis of asynchronous distributed systems based on Petri nets and causal memory. With causal memory, players exchange their entire causal past only upon synchronization. Without synchronization, players have no information of each other. For safety winning conditions, the synthesis algorithm for Petri games with a bounded number of controllable components and one uncontrollable component is implemented in AdamSYNT [12] 2. Both tools are command-line tools lacking visual support to model Petri nets with transits or Petri games and the possibility to simulate or interactively explore implementations, counterexamples, and parts of the created state space.

In this paper, we present a web interface<sup>3</sup> for model checking asynchronous distributed systems with data flows and for the synthesis of asynchronous distributed systems with causal memory from safety specification. The web interface offers an input for Petri nets with transits and Petri games where the user interactively creates places, transitions, and their connections with a few inputs.

As a back-end, the algorithms of AdamMC are used to model check Petri nets with transits against a given Flow-LTL formula as specification. Internally, the problem is reduced to the model checking problem of Petri nets against LTL. Both, the input Petri net with transits and the constructed Petri net can be visualized and simulated in the web interface. For a positive result, the web interface lets the user follow the control flow of the combined system and the data flow of the components. For a negative result, the web interface simulates the counterexample with a visual separation of the global and each local behavior.

The algorithms of AdamSYNT solve the given Petri game with safety specification. Internally, the problem is reduced to solving a finite two-player game with complete information. For a positive result, a winning strategy for the Petri game and the two-player game can be visualized and the former can be simulated. For a negative result, the web interface lets the user interactively construct strategies of the two-player game and highlights why they violate the specification. These new intuitive construction methods, interactive features, and visualizations are of great impact when developing asynchronous distributed systems.

#### **2 Web Interface for Petri Nets with Transits**

The web interface can model check Petri nets with transits against Flow-LTL. We use an example from software-defined networks to showcase the workflow.

<sup>2</sup> AdamSYNT was previously called Adam. From now on, AdamMC and AdamSYNT are combined in the tool Adam (https://github.com/adamtool/adam).

<sup>3</sup> The web interface is open source (https://github.com/adamtool/webinterface) and a corresponding artifact to set it all up locally in a virtual machine is available [16].

**Fig. 1.** Screenshot from the web interface for the model checking workflow.

**Workflow for Petri Nets With Transits** One application domain for Petri nets with transits are software-defined networks (SDNs) [20,4]. The nodes of the network are switches which forward packets along the edges of the network according to the routing configuration. Packets enter the network at ingress switches and leave it at egress switches. SDNs separate the packet forwarding process, called the data plane, from the routing process, called the control plane. Concurrent updates to the routing configuration are difficult to get right [15].

The separation of data and control plane and updates to the routing configuration can be encoded into Petri nets with transits [9]. Using this encoding, we demonstrate the workflow of the web interface for model checking an asynchronous distributed system with data flows. The packets of the SDN are modeled by the data flow in the Petri net with transits. The data flow relation as an extension from Petri nets to Petri nets with transits is depicted as colored and labeled arcs. In Fig. 1, the web interface presents the resulting Petri net with transits N. First, we use the tools on the left to create for each switch a place si with i ∈ {0,..., 5} and add a token (cf. outer parts of N). Then, we create transitions for the connections between the switches and for the origin of packets in the SDN (cf. transition ingress in the top-left corner) and link them with flows in both directions. Additionally, we create local transits between the switches corresponding to the forwarding of packets. They are displayed in light blue and red and are identified by the letters. This constitutes the data plane.

Next, we define the control plane, i.e., which forwarding is activated. Each transition to forward packets is connected to a place ai with i ∈ {0,..., 5} which has a token when the forwarding is configured initially (cf. places a3, a4, and a5) and no token otherwise (cf. places a0, a1, and a2). For the concurrent update, we create places ui with i ∈ {0,..., 7} and transitions ti with i ∈ {6,..., 11} with corresponding flows (cf. inner parts of N).

Transitions for the forwarding are set as weak fair, i.e., whenever a transition is infinitely long enabled in a run, it also has to fire infinitely often, indicated by the purple color of the outer transitions. Transitions for the update do not require fairness assumptions. A satisfied Flow-LTL formula is AF s5 specifying that all packets eventually reach switch s5. An unsatisfied formula is (G u0 ⇒ AF s2) requiring for runs, where the update is never executed, that all packets are taking the lower-left route. The fairness assumptions and a maximality assumption, i.e., whenever some transition can fire in a run some transition fires, are automatically added to the formula. In the screenshot, a counterexample for the unsatisfied formula is displayed on the right. The first packet takes the upper-right route via transitions t3, t4, and t5 and the update never starts.

**Features for Petri Nets with Transits.** AdamMC [10] is a command-line model checking tool for Petri nets with transits and Flow-LTL [9]. The model checking problem of Petri nets with transits against Flow-LTL is solved by a reduction to Petri nets and LTL. The web interface allows displaying and arranging the nodes of the Petri net from the reduction and the input Petri net with transits. Automatic layout techniques are applied to avoid the overlapping of nodes. A physics control, which modifies the repulsion, link, and gravity strength of nodes, can be used to minimize the overlapping of edges. Heuristics generate coordinates for the constructed Petri net by using the coordinates of the input Petri net with transits to obtain a similar layout of corresponding parts.

For a positive result, the web interface allows visualizing the data flow trees for given firing sequences of the nets. For a negative result, the counterexample can be simulated both in the Petri net with transits and in the Petri net from the reduction. The witness of the counterexample for each flow subformula and the run violating the global behavior can be displayed by the web interface. This functionality is helpful when developing an encoding of a problem into Petri net with transits to ensure that a counterexample is not an error in the encoding. The constructed Petri net can be exported into a standard format for Petri net model checking (PNML) and the constructed LTL formula can be displayed.

#### **3 Web Interface for Petri Games**

The web interface can synthesize local controllers from safety specifications. The workflow is showcased for a distributed alarm system given as a Petri game.

**Workflow for Petri Games** We demonstrate the workflow of the web interface for the synthesis of asynchronous distributed systems with causal memory from safety specifications. Petri games separate the places of an underlying Petri net into system places and environment places. Tokens on system places are system players and tokens on environment places are environment players. Each player has causal memory: only upon synchronization with other players, they exchange their entire causal past. For safety specifications, the system players have to avoid that a bad place is reached for all behaviors of the environment players.

**Fig. 2.** Screenshot from the web interface for the synthesis workflow.

We want to obtain two local controllers of a distributed alarm system that should indicate the location of a burglary at both controllers. In Fig. 2, the web interface presents the resulting Petri game on the left and the winning strategy for the alarm system on the right. The burglar is modeled by an environment player and each component of the distributed alarm system by a system player. Environment players are on white places and system players on gray ones. We create five environment places e0, e1, e2, eL, and eR. The place e0 has a token, e1 and e2 serve for the decision to burgle a location, and eL and eR for actually burgling the location. Each component x ∈ {p, q} of the alarm system has one system place x0 with a token, two system places x1 and x2 to detect a burglary and inform the other component, and two system places xL and xR to sound an alarm with the position of a burglary. We create rows of transitions for the environment player deciding where to burgle (first row), for the components detecting a burglary (second row), for the communication between the components (third row), and for sounding the alarm at each location (fourth row).

At last, we use transitions fai with i ∈ {0,..., 3} and fr j with j ∈ {0,..., 7} connected to the bad place bad to define that the implementation of the distributed alarm system should avoid false alarms and false reports. A false alarm occurs if the burglar did not burgle any location but an alarm occurred, i.e., in every pair of places {e0}×{pL, pR, qL, qR}. A false report occurs if a burglary happened at a location but a component of the alarm system indicates a burglary at the other location, i.e., in every pair of places {e1, eL}×{pR, qR} and {e2, eR}×{pL, qL}. We add transitions and flows to bad for these cases.

The web interface finds a winning strategy (depicted on the right in Fig. 2) for the Petri game described above. Each component locally monitors its location (t2, t3) and simultaneously waits for information about a burglary at the other location (t4, t5). When a burglary is detected at the location of the component then it first informs the other component (t4, t5) and then outputs an alarm for the current location (t7, t8). When a component is informed about a burglary at the other location, it outputs an alarm for the other location (t6, t9).

**Features for Petri Games** AdamSYNT [12] is a command-line tool for Petri games [14]. The synthesis problem for Petri games with a bounded number of system players, one environment player, and a safety objective is reduced to the synthesis problem for two-player games. A winning strategy in the two-player game is translated into a winning strategy for the Petri game. Both can be visualized in the web interface. Here, the web interface provides the same features for visualizing, manipulating, and automatically laying out the elements as for model checking. It uses the order of nodes of the Petri game to heuristically provide a positioning of the strategy and allows simulating runs of the strategy. The winning strategy of the two-player game provides an additional view on the implementation to check if it is not bogus due to a forgotten case in the Petri game or specification. For an unrealizable synthesis problem, the web interface allows analyzing the underlying two-player game via a stepwise creation of strategies. This guides the user towards changes to make the problem realizable.

#### **4 Implementation Details**

The server is implemented using the Sparkjava micro-framework [23] for incoming HTTP and WebSocket connections. The client is a single-page application written in Javascript using Vue.js [25], D3 [5], and the Vuetify component library [26]. We constructed libraries out of the tools AdamMC and AdamSYNT and implemented one interface handling both libraries. Common features like the physics control of nodes share the same implementation. All components of the libraries and the web interface [2] are open source and available on GitHub [1].

#### **5 Conclusion**

We presented a web interface for two tools: AdamMC, a model checker for data flows in asynchronous distributed systems represented by Petri nets with transits, and AdamSYNT, a synthesis tool for local controllers from safety specifications in asynchronous distributed systems with causal memory represented by Petri games. The web interface makes the modeling and debugging of Petri nets with transits and Petri games user-friendly as it presents visual representations of the input, all intermediate steps, and the output of the tools. The interactive features are a great assistance for correctly modeling distributed systems.

We plan to extend the web interface and tool support to model checking Petri nets with transits against Flow-CTL<sup>∗</sup> [11], to other classes of Petri games with a decidable synthesis problem [13,3], to the bounded synthesis approach for Petri games [7,8,19,18], and to high-level Petri games [17]. As our web interface is open source and easy to extend, we also plan to connect it to other tools for Petri nets like APT [24], LoLA [27], or TAPAAL [6].

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

#### **Momba: JANI Meets Python***-*

Maximilian A. Köhl<sup>1</sup> (-), Michaela Klauck<sup>1</sup> , and Holger Hermanns<sup>1</sup>,<sup>2</sup>

<sup>1</sup>Saarland University, Saarland Informatics Campus, Saarbrücken, Germany <sup>2</sup>Institute of Intelligent Software, Guangzhou, China {koehl,klauck,hermanns}@cs.uni-saarland.de

**Abstract.** JANI-model [6] is a model interchange format for networks of interacting automata. It is well-entrenched in the quantitative model checking community and allows modeling a variety of systems involving concurrency, probabilistic and real-time aspects, as well as continuous dynamics. Python is a general purpose programming language preferred by many for its ease of use and vast ecosystem. In this paper, we present *Momba*, a flexible Python framework for dealing with formal models centered around the JANI-model format and formalism. Momba strives to deliver an integrated and intuitive experience for experimenting with formal models making them accessible to a broader audience. To this end, it provides a pythonic interface for model construction, validation, and analysis. Here, we demonstrate these capabilities.

#### **1 Introduction**

Dealing with formal models encompasses a variety of tasks which can be challenging from time to time—especially for newcomers. Everything starts with the *construction* of a model or a family thereof. Often a textual or other, more formal, description of the scenario to be modeled is already existing, such as a rough sketch of the desired behavior or a circuit diagram. Then, after a formal model has finally been conceived, one has to *validate* that the model actually adequately models what should be modeled. In this regard models are just like any other human artifact, inadequate initially but over time it gets better. Only after confidence in the model has been established, one is able to harvest the benefits by handing over the model to *analysis* tools, e. g., a model checker.

In this paper, we present *Momba*, a flexible Python framework for dealing with formal models. Momba strives to deliver an integrated and intuitive experience to aid the process of model construction, validation, and analysis. It provides convenience functions for the constructions of models effectively turning Python into a syntax-aware macro language enabling the construction of models in a modular fashion. Momba's built-in simulation engine allows gaining

<sup>-</sup> This work was partially supported by the ERC Advanced Investigators Grant 695614 (POWVER), by the German Research Foundation (DFG) under grant No. 389792660, as part of TRR 248, see https://perspicuous-computing.science, and by the Key-Area Research and Development Program Grant 2018B010107004 of Guangdong Province.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 389–398, 2021. https://doi.org/10.1007/978-3-030-72013-1\_23

confidence in a model, for instance, by rapidly prototyping a tool for interactive model exploration and visualization, or by connecting it to a testing framework. Finally, thanks to the JANI-model [6] interchange format, several state-of-theart model checkers and other tools are readily available for analysis. The latest version of Momba is always available on GitHub [1] and the evaluated artifact of this tool demo paper can be found on Zenodo [27].

*Why Momba?* The idea to harvest a general purpose programming environment for formal modelling is not new at all. For instance, the SVL language combines the power of process algebraic modelling with the power of the bourne shell. As part of many CADP installations [12,13], it is in daily use since its inception [11]. Many formal modeling tools also already provide Python bindings [23,10]. Momba tries not to be yet another incarnation of these ideas.

While the construction of formal models clearly is an integral part of Momba, Momba is more than just a framework for constructing models with the help of Python. Most importantly, it also provides features to work with these models such as a simulator or an interface to different model checking tools. At the same time, it is not just a binding to an API developed for another language, say C**++**. Momba is *tool-agnostic* and aims to provide a pythonic interface for dealing with formal models while leveraging existing tools. Momba covers the whole process from model creation through validation to analysis. To this end, it is centered around the well-entrenched JANI-model interchange format.

*Why JANI?* Traditionally, most analysis tools for formal models came with their own modeling languages and formats. The resulting fragmentation hindered interoperability between and comparability across different tools. JANI-model [6] has been conceived with the vision to put an end to this fragmentation. It has since been adopted by many quantitative model checkers [20,21,9] while for others translators have been developed [20,9] enabling cross-tool comparability and fostering competition within the community [22,19,7]. Recently, JANI has also been discovered by the planning community [24,25].

Momba supports all features of the JANI-model specification and some of its optional extensions. JANI is the natural foundation for a project like Momba. It provides a solid, well-established, and powerful modeling formalism for a variety of different kinds of systems involving concurrency, probabilistic and real-time aspects, as well as continuous dynamics. A JANI model is a network of interacting automata with variables. Attached to a model one can also specify various kinds of probabilistic and timed properties which can then be checked by several model checkers, e. g., ePMC [20], The Modest Toolset [21], and Storm [23]. The broad tool support for JANI models enables us to build upon existing research and to outsource computation-intensive tasks via unified interfaces.

*Why Python?* Python is a popular high-level programming language, preferred by many for its ease of use and ecosystem. Especially within the data-science community, Python is the go-to language for data analysis and machine learning leaveraging tools such as TensorFlow [2] and scikit-learn [29]. Around these tools, scientific general purpose tools such as Jupyter [26] have emerged. Jupyter

provides a platform for documenting scientific experiments and their results in a reproducible way combining code, data, and documentation.

Our vision is to harvest Python's ecosystem and the tools developed by the scientific community for dealing with formal models. Imagine, a Jupyter notebook documenting a model, including the code to construct it, with interactive visualizations of the model itself and various analysis results.

By basing our efforts on a popular language that is appreciated by scientists and established in the scientific community, we hope to lower the entry barrier, especially for those outside the formal methods community.

*The User Perspective.* In what follows, we demonstrate multiple facets of Momba using a variant of Racetrack, a well-known benchmark in autonomous AI decision making [4,31] which has recently found its use in several model checking contexts [16,3,15]. too. We go through the entire process from the construction of a family of models through their validation to their analysis. For each step, we highlight what Momba has to offer in terms of effectively supporting the process.

Originally Racetrack has been a pen-and-paper game [14]. A *track* is a twodimensional grid comprising *start*, *goal*, *wall*, and *blank* cells (cf. Fig. 1) [4]. A vehicle starts off with some initial velocity from a start cell, with the objective to reach a goal cell as fast as possible without crashing into a wall. The vehicle is controlled by nine possible actions modifying the current velocity vector. Racetrack naturally lends itself as a benchmark for sequential decision making in risky scenarios, in particular, when extended with probabilistic noise. In a variety of such noisy forms, it has been adopted as a benchmark for *Markov Decision Process* (MDP) algorithms in the AI community [4,5,28,30,31].

For our demonstration, we consider multiple *variants* of Racetrack giving rise to a family of MDPs, studied recently [3] from a feature-oriented perspective [8]. For example, there are different tank options and fuel is consumed according to various consumption models. In addition, there are different undergrounds inducing probabilistic noise modeling slippery road conditions. Clearly, this modeling scenario is beyond what is possible with mere model parametrization, especially so because we are interested in the car's performance on different tracks each inducing its own MDP [4].

#### **2 Scenario-Based Model Construction**

Usually, formal models are not constructed out of thin air but based on some kind of scenario description existing upfront. Such descriptions usually comprise an operational characterization of the behavior to model together with additional and sometimes more formal information about the specific case. Our use case is no exemption, here a textual description of the behavior of the car is provided together with a specific track and a specification of the variant.

Naturally, Python can be used to nicely capture the formal parts of a scenario description in various data structures. Combined with a domain-specific parser for configuration files, scenario descriptions are interchangeable and easy to interface with the code for model construction. In our case, a textual representation of the track (cf. Fig. 1) [4] is provided and parsed together with additional

**Fig. 1.** Textual representation (left) and picture of a track (right): start cells in blue (s), goal cells in green (g), and wall cells marked with x.

parameters, like the size of the tank and the type of the underground, into a data structure tailored to that purpose.

Now, how does Momba support the construction of models from such data structures? A distinguishing feature of Momba is that it effectively turns Python into a syntax-aware macro language enabling the modular construction of models. For our Racetrack use case different fuel consumption models can be captured as macros from JANI expressions to JANI expressions:

```
linear = lambda dx, dy: expr("abs($dx) + abs($dy)", dx=dx, dy=dy)
quadratic = lambda dx, dy: expr("$linear ** 2", linear=linear(dx, dy))
```
A macro is simply a Python function. Upon execution, these macros construct JANI expressions using a straightforward syntax inspired by Python expressions. In this case, both functions take expressions for the current velocity of the vehicle in and y dimension and return an expression for the resulting fuel consumption which is either *linear* or *quadratic* in the velocity. In contrast to how macros work in languages like C, syntax-aware macros using Momba's expr function prevent surprises from mere text-based expansion. Being Python functions, macros can be easily passed around and used elsewhere:

```
assignments = {
    "fuel": expr(
        "min(TANK_SIZE, max(0, fuel - floor($consumption)))",
        consumption=fuel_model(car_dx, car_dy),
    )
}
```
Here, we update the fuel level by taking whatever macro has been provided for computing the fuel consumption. This code is part of constructing an edge for the tank automaton in a modular fashion in the sense that the consumption model is exchangeable. Momba provides further functions, for instance, for declaring variables, like fuel, and constructing automata, networks, as well as other model objects. Most of these functions provide all kinds of comforts, for instance, directly checking the types of the involved expressions.

Using syntax-aware macros and Momba's other convenience functions, we arrive at a Python script racetrack.py [27] generating a collection of JANI models from scenario descriptions comprising a track and specifying a variant. Iterating over possible scenario descriptions, hundreds of JANI models can be generated fully automatically and consequently be analyzed.

#### **3 Validation by Simulation**

Having our models ready, we have to somehow gain confidence that they actually model what we want them to, before handing them over to analysis tools. One way of gaining confidence into a model is by simulating its behavior and manually checking it for consistency with the own understanding of what the model should do. Just like any kind of debugging, this can be a tedious and frustrating process, especially with text-based traces generated by some generic simulator. Momba instead comprises a built-in simulation engine, enabling rapid development of interactive visualizations. This effectively allows us to steer a vehicle through a track thereby exploring a model's behavior, testing edge cases as in a racing game, and ultimately gaining confidence in the model.

Momba's built-in simulation engine supports the simulation of a variety of different JANI models including timed models. It has been written completely from scratch with easy accessibility from Python in mind. Non-determinism can be resolved by uniform random sampling or by querying an external oracle such as, in the case of our interactive visualization, the user, a testing framework, or even a neural network as done for DSMC [16]. For each step, the simulator provides all the necessary information like the binding of variables to values, the locations the various automata of a network are in, and the possible actions (and time delays for timed models) that can be taken. This information can then be extracted and used to display whatever is of interest for understanding and investigating the behavior of the model under scrutiny.

Fig. 2 shows a simple interactive visualization of the Racetrack example based on Momba's simulation engine where the user can steer the vehicle (indicated by the yellow asterisk) through the track by entering acceleration values. Certainly, there is ample room for beautification of this simulator (see TraceVis [15] for example) but for rapid model development this is not needed. After playing around with the interactive simulation for a while and testing various edge cases, we are confident that the model is adequate.

**Fig. 2.** Interactive visualization using Momba's simulation engine.

#### **4 Harvesting the Benefits**

Having constructed the models and gained confidence in their adequacy, we are now ready to harvest the benefits of formal modeling and to apply various stateof-the-art analysis tools, exploiting the JANI-model interchange. Again, Momba provides the necessary functions to define properties and hand our models, with the respective properties attached to them, over to common analysis tools.

Imagine that we are interested in the property <sup>P</sup>max (♦ on\_goal <sup>∧</sup> fuel <sup>&</sup>gt; 0), i. e., the maximal probability of reaching a goal cell with a non-empty tank from a given start cell. Using Momba's syntax-aware macros, we first construct a disjunction over all goal cells and then define the property using the concise syntax provided by Momba's prop function:

```
on_goal = reduce(lor, (expr("car_pos == $g", g=g) for g in goal_cells), False)
define_property(
    prop("min({ Pmax(F($on_goal and fuel > 0)) | initial })", on_goal=on_goal),
    name="goalProbabilityFuel",
```
)

After generating a model with the vehicle starting from position (0, 7) on the track depicted in Fig. 1 and with sand as underground, the value iteration engine mcsta [18] of The Modest Toolset calculates a probability of 87.5 % taking 153 s when invoked by Momba with the model. Momba also cross-checks the results for us, by invoking Storm's dd engine [9] (the fastest engine for this model) and obtains the same result in 107 s. These experiments have been carried out on a standard laptop with an Intel Core i7 at 2.7 GHz.

### **5 Conclusion**

We presented Momba, a Python framework for dealing with quantitative models covering the whole process of model creation, validation, and analysis providing an integrated and intuitive experience. In a user story on Racetrack, we demonstrated how Momba's capabilities can be used throughout all stages of the development process of cyber-physical models.

We demonstrated how Momba enables scenario-based model construction with Python code in a concise and modular way with syntax-aware macros. Using Momba's simulation engine, we were able to rapidly prototype an interactive visualization thereby gaining confidence in our models and, finally, thanks to JANI-model, we demonstrated how to analyse our models with state-of-the-art model checkers directly invoked and cross-checked by Momba.

By basing Momba on Python, we aim to harvest the tools developed by the data-science community. Especially, when combined with Jupyter [26], Momba enables literate programming [32] combining code, data, and documentation for reproducible experiments and process documentation.

We hope that Momba helps to open up the world of formal modeling towards a broader community by lowering or removing barriers otherwise obstructing the application of formal models. Momba's infrastructure is implemented in such a way that it can easily be extended into other directions and for connections to other research areas, e. g., model checking policies machine learned with Python libraries [16,17].

#### **References**


Proceedings. Lecture Notes in Computer Science, vol. 8413, pp. 593–598. Springer (2014). https://doi.org/10.1007/978-3-642-54862-8\_51


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4. 0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

**SV-Comp Tool Competition Papers**

### Software Verification: 10th Comparative Evaluation (SV-COMP 2021)

Dirk Beyer -

LMU Munich, Munich, Germany

Abstract. SV-COMP 2021 is the 10th edition of the Competition on Software Verification (SV-COMP), which is an annual comparative evaluation of fully automatic software verifiers for C and Java programs. The competition provides a snapshot of the current state of the art in the area, and has a strong focus on reproducibility of its results. The competition was based on 15 201 verification tasks for C programs and 473 verification tasks for Java programs. Each verification task consisted of a program and a property (reachability, memory safety, overflows, termination). SV-COMP 2021 had 30 participating verification systems from 27 teams from 11 countries.

Keywords: Formal Verification · Program Analysis · Competition · Software Verification · Verification Tasks · Benchmark · C Language · Java Language · SV-Benchmarks

### 1 Introduction

Among several other objectives, the Competition on Software Verification (SV-COMP, https://sv-comp.sosy-lab.org/2021) showcases the state of the art in the area of automatic software verification. This edition of SV-COMP is already the 10th edition of the competition and presents again an overview of the currently achieved results by tool implementations that are based on the most recent ideas, concepts, and algorithms for fully automatic verification. This competition report describes the (updated) rules and definitions, presents the competition results, and discusses some interesting facts about the execution of the competition experiments. The objectives of the competitions were discussed earlier (1-4 [16]) and extended over the years (5-6 [17]):


This report extends previous reports on SV-COMP [10, 11, 12, 13, 14, 15, 16, 17].

Reproduction packages are available on Zenodo (see Table 4).

Funded in part by the Deutsche Forschungsgemeinschaft (DFG) – 378803395 (ConVeY). dirk.beyer@sosy-lab.org

<sup>©</sup> The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 401–422, 2021.

https://doi.org/10.1007/978-3-030-72013-1\_24


The previous report [17] discusses the outcome of the SV-COMP competition so far with respect to these objectives.

Related Competitions. Competitions are an important evaluation method and there are many competitions in the field of formal methods. We refer to the previous report [17] for a more detailed discussion and give here only the references to the most related competitions [9, 19, 55, 56].

Quick Summary of Changes. We strive to continuously improve the competition, and this report describes the changes of the last year. In the following we list a brief summary of new items in SV-COMP 2021:


### 2 Organization, Definitions, Formats, and Rules

Procedure. The overall organization of the competition did not change in comparison to the earlier editions [10, 11, 12, 13, 14, 15, 16, 17]. SV-COMP is an open competition (also known as comparative evaluation), where all verification tasks are known before the submission of the participating verifiers, which is necessary due to the complexity of the C language. The procedure is partitioned into the *benchmark submission* phase, the *training* phase, and the *evaluation* phase. The participants received the results of their verifier continuously via e-mail (for pre-runs and the final competition run), and the results were publicly announced on the competition web site after the teams inspected them. The *Competition Jury* oversees the process and consists of the competition chair and one member of each participating team. Team representatives of the jury are listed in Table 5.


Table 1: Tools for witness-based result validation (validators) and witness linter

License Requirements. Starting 2018, SV-COMP required that the verifier must be publicly available for download and has a license that

(i) allows reproduction and evaluation by anybody (incl. results publication),

(ii) does not restrict the usage of the verifier output (log files, witnesses), and

(iii) allows any kind of (re-)distribution of the unmodified verifier archive.

During the qualification phase, when the jury members inspect the verifier archives, several issues with licenses (missing licenses, incompatibilities) were detected that the developers were able to address the issues on time.

With SV-COMP 2021, the community started the process of making the benchmark collection REUSE compliant (https://reuse.software) by adding SPDX license identifiers (https://spdx.dev). A few directories are properly labeled already, and continuous-integration checks with REUSE ensure that new contributions adhere to the standard.

Validation of Results. This time, the validation of the verification results was done by seven validation tools, which are listed in Table 1, including references to literature. The validators CPAchecker and UAutomizer support the competition since the beginning of its result validation in 2015. Execution-based validation was added in 2018 using CPA-w2t and FShell-w2t. Two new validators participated since the previous SV-COMP in 2020: Nitwit and MetaVal. A few categories were still excluded from validation because no validators were available for some types of programs or properties.

For SV-COMP 2021, the new validator WitnessLint was added for validating witnesses regarding their syntax. It checks the witnesses produced by the verification tools against the specification of the format for verification witnesses (https://github.com/sosy-lab/sv-witnesses/tree/svcomp21). For example, WitnessLint ensures that a verification witness is a proper XML/GraphML file and contains the required meta data. This means that the validators can focus on the validation of the verification result, assuming that the verification witness is syntactically valid. If the witness linter deems a verification witness as syntactically invalid, then the answers of the result validators are ignored and the result is not counted as confirmed.

Task-Definition Format 2.0. The format for the task definitions in the SV-Benchmarks repository was recently extended to include a set of

options that can carry information from the verification task to the verification tool. SV-COMP 2021 used the task-definition format in version 2.0 (https://gitlab.com/sosy-lab/benchmarking/task-definition-format/-/tree/2.0). More details can be found in the report for Test-Comp 2021 [19].

Properties. Please see the 2015 competition report [13] for the definition of the properties and the property format. All specifications are available in the directory c/properties/ of the benchmark repository.

Categories. The updated category structure is illustrated by Fig. 1. The categories are also listed in Tables 7 and 8, and described in detail on the competition web site (https://sv-comp.sosy-lab.org/2021/benchmarks.php). Compared to the category structure for SV-COMP 2020, we added the sub-categories *XCSP* and *Combinations* to category *ReachSafety*, and the sub-categories *DeviceDriversLinux64Large ReachSafety*, *uthash MemSafety*, *uthash NoOverflows*, and *uthash ReachSafety* to category *SoftwareSystems*.

Another effort was to integrate some of the Juliet benchmark tasks [31] into the SV-Benchmarks collection. We requested a license for the Juliet programs that properly clarifies the license terms also outside the USA. We thank our colleagues from NIST for releasing their Juliet benchmark (which is declared as public domain) under the Creative Commons license CC0-1.0 (https://github.com/sosy-lab/sv-benchmarks/blob/svcomp21/LICENSES/CC0-1.0.txt). SV-COMP 2021 used many verification tasks from Juliet, in particular for the memory-safety properties CWE121 (stack-based buffer overflow), CWE401 (memory leak), CWE415 (double free), CWE476 (null-pointer dereference), and CWE590 (free memory that is not on the heap) (see https://github.com/sosy-lab/sv-benchmarks/blob/svcomp21/c/MemSafety-Juliet.set).

All those new contributions to the benchmark collection lead to the growth of the number of verification tasks from 11 052 in SV-COMP 2020 to 15 201 in SV-COMP 2021.

Verification Tasks. The previous verification tasks and competition rules used special definitions for the functions \_\_VERIFIER\_error and \_\_VERIFIER\_assume. These special definitions were found to be unintuitive and inconsistent with expectations in the verification community, and repeatedly caused confusion among participants. A call of function \_\_VERIFIER\_error() was defined to never return. A call of function \_\_VERIFIER\_assume(p) was defined such that if expression p evaluates to false, then the function loops forever, otherwise the function returns without any side effects. This led to unintended interactions with other properties.

We eliminated these two functions in two steps. In the first step, each function call was replaced by a C-code implementation of the intended behavior. In most of the cases, \_\_VERIFIER\_error(); was replaced by the C code reach\_error(); abort();, where reach\_error is a 'normal' function, i.e., one whose interpretation follows the C standard [3].

Eliminating \_\_VERIFIER\_assume was more complicated: In some tasks for property *memory-cleanup*, \_\_VERIFIER\_assume(p); was replaced by the C code assume\_cycle\_if\_not(p);, which is implemented

Fig. 1: Category structure for SV-COMP 2021; category *C-FalsificationOverall* contains all verification tasks of *C-Overall* without *Termination*; *Java-Overall* contains all Java verification tasks; compared to SV-COMP 2020, there are two new sub-categories in *ReachSafety* and four new sub-categories in *SoftwareSystems*


Table 2: Scoring schema for SV-COMP 2021 (new: no point for unconfirmed correct results anymore)

Fig. 2: Visualization of the scoring schema for the reachability property (adjusted from a previous report [15])

as if (!p) while(1);, while for other tasks, \_\_VERIFIER\_assume(p); was replaced by assume\_abort\_if\_not(p);, which is implemented as if (!p) abort();. The solution nicely illustrates the problem of the special semantics: Consider property *memory-cleanup*, which requires that all allocated memory is deallocated before the program terminates. Here, the desired behavior of a failing assume statement would be that the program does not terminate (and does not unintendedly violate the *memory-cleanup* property). Now consider property *termination*, which requires that every path finally reaches the end of the program. Here, the desired behavior of a failing assume statement would be that the program terminates (and does not unintendedly violate the *termination* property).

In the second step, the specifications for functions \_\_VERIFIER\_error and \_\_VERIFIER\_assume were removed from the competition rules (because no such functions exist anymore in the SV-Benchmarks collection).

Scoring Schema and Ranking. Table 2 provides an overview and Fig. 2 visually illustrates the score assignment for the reachability property as an example. The scoring schema was changed regarding the special rule for unconfirmed correct results for expected result True. There was a rule during the transitioning phase to assign one point if the answer matches the expected result but the witness was not confirmed. Now score points are only assigned if the results got validated (or no validator was available).

As in the last years, the rank of a verifier was decided based on the sum of points (normalized for meta categories). In case of a tie, the rank was decided based on success run time, which is the total CPU time over all verification tasks for which the verifier reported a correct verification result. *Opt-out from Categories* and *Score Normalization for Meta Categories* was done as described previously [11] (page 597).

#### 3 Reproducibility

To allow independent reproduction of the SV-COMP results, we made all major components that were used in the competition available in public versioncontrol repositories. An overview of the components that contribute to the reproducible setup of SV-COMP is provided in Fig. 3, and the details are given in Table 3. We refer to the SV-COMP 2016 report [14] for a description of all components of the SV-COMP organization.

We have published the competition artifacts at Zenodo (see Table 4) to guarantee their long-term availability and immutability. These artifacts comprise the verification tasks, the competition results, the produced verification witnesses, and the BenchExec package. The archive for the competition results includes the raw results in BenchExec's XML exchange format, the log output of the verifiers and validators, and a mapping from file names to SHA-256 hashes. The hashes of the files are useful for validating the exact contents of a file, and accessing the files inside the archive that contains the verification witnesses.

Competition Workflow. The workflow of the competition is described in the report for Test-Comp 2021 [19].

CoVeriTeam. The competition was for the first time supported by CoVeriTeam [26] (https://gitlab.com/sosy-lab/software/coveriteam/), which is a tool for cooperative verification. Among its many capabilities, it enables remote execution of verification runs directly on the competition machines, which was found to be a valuable service for trouble shooting.

#### 4 Results and Discussion

The results of the competition experiments represent the state of the art in fully automatic software-verification tools. The report shows the results, in terms of effectiveness (number of verification tasks that can be solved and correctness of the results, as accumulated in the score) and efficiency (resource consumption in terms of CPU time and CPU energy). The results are presented in the same way as in last years, such that the improvements compared to last year are easy

Fig. 3: Benchmarking components of SV-COMP and competition's execution flow (same as for SV-COMP 2020)

Table 3: Publicly available components for reproducing SV-COMP 2021




to identify. The results presented in this report were inspected and approved by the participating teams. We now discuss the highlights of the results.

Participating Verifiers. Table 5 provides an overview of the participating verification systems (see also the listing on the competition web site at https://sv-comp.sosy-lab.org/2021/systems.php). Table 6 lists the algorithms and techniques that are used by the verification tools.

Automatic Participation. To ensure that the comparative evaluation continues to give an overview of the state of the art that is as broad as possible, a rule was introduced before SV-COMP 2020 which enables the option for the organizer to reuse systems that participated in previous years for the comparative evaluation. This option was used three times in SV-COMP 2021: for Coastal, PredatorHP, and SPF. Those participations are marked as 'hors concours' in Table 5.


Table 5: Competition candidates with tool references and representing jury members

Computing Resources. The resource limits were the same as in the previous competitions [14]: Each verification run was limited to 8 processing units (cores), 15 GB of memory, and 15 min of CPU time. Witness validation was limited to 2 processing units, 7 GB of memory, and 1.5 min of CPU time for violation witnesses and 15 min of CPU time for correctness witnesses. The machines for running the experiments are part of a compute cluster that consists of 168 machines; each verification run was executed on an otherwise completely unloaded, dedicated machine, in order to achieve precise measurements. Each machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system (x86\_64-linux, Ubuntu 20.04 with Linux kernel 5.4). We used BenchExec [28] to measure and control computing resources (CPU time, memory, CPU energy) and VerifierCloud (https://vcloud.sosy-lab.org) to distribute, install, run, and clean-up verification runs, and to collect the results. The values for time and


Table 6: Algorithms and techniques that the competition candidates used

energy are accumulated over all cores of the CPU. To measure the CPU energy, we used CPU Energy Meter [30] (integrated in BenchExec [28]).

One complete verification execution of the competition consisted of 163 177 verification runs (each verifier on each verification task of the selected categories according to the opt-outs), consuming 470 days of CPU time and 126 kWh of CPU energy (without validation). Witness-based result validation required 961 919 validation runs (each validator on each verification task for categories with witness validation, and for each verifier), consuming 274 days of CPU time. Each tool was executed several times, in order to make sure no installation issues occur during the execution. Including preruns, the infrastructure managed a total of 1.33 million verification runs consuming 4.16 years of CPU time, and 7.31 million validation runs consuming 3.84 years of CPU time.

Quantitative Results. Table 7 presents the quantitative overview of all tools and all categories. The head row mentions the category, the maximal score for the category, and the number of verification tasks. The tools are listed in alphabetical order; every table row lists the scores of one verifier. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the verifier opted-out from the respective main category (perhaps participating in subcategories only, restricting the evaluation to a specific topic). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site (https://sv-comp.sosy-lab.org/2021/results) and in the results artifact (see Table 4).

Table 8 reports the top three verifiers for each category. The run time (column 'CPU Time') and energy (column 'CPU Energy') refer to successfully solved verification tasks (column 'Solved Tasks'). We also report the number of tasks for which no witness validator was able to confirm the result (column 'Unconf. Tasks'). The columns 'False Alarms' and 'Wrong Proofs' report the number of verification tasks for which the verifier reported wrong results, i.e., reporting a counterexample when the property holds (incorrect False) and claiming that the program fulfills the property although it actually contains a bug (incorrect True), respectively.

Score-Based Quantile Functions for Quality Assessment. We use scorebased quantile functions [11, 28] because these visualizations make it easier to understand the results of the comparative evaluation. The web site (https://sv-comp.sosy-lab.org/2021/results) and the results archive (see Table 4) include such a plot for each (sub-)category. As an example, we show the plot for category *C-Overall* (all verification tasks) in Fig. 4. A total of 10 verifiers participated in category *C-Overall*, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [11]). A more detailed discussion of score-based quantile plots, including examples of what insights one can obtain from the plots, is provided in previous competition reports [11, 14].

Alternative Rankings. The community suggested to report a couple of alternative rankings that honor different aspects of the verification process as complement to the official SV-COMP ranking. Table 9 is similar to Table 8, but

Table 7: Quantitative overview over all results; empty cells represent opt-outs; an asterisk after the tool name marks hors-concours participation



Table 8: Overview of the top-three verifiers for each category (measurement values for CPU time and energy rounded to two significant digits)

contains the alternative ranking categories *Correct* and *Green Verifiers*. Column 'Quality' gives the score in score points, column 'CPU Time' the CPU usage of successful runs in hours, column 'CPU Energy' the CPU usage of successful runs in kWh, column 'Solved Tasks' the number of correct results, column 'Wrong Re-

Fig. 4: Quantile functions for category *C-Overall*. Each quantile function illustrates the quantile ( -coordinate) of the scores obtained by correct verification runs below a certain run time (y-coordinate). More details were given previously [11]. A logarithmic scale is used for the time range from 1 s to 1000 s, and a linear scale is used for the time range between 0 s and 1 s.


Table 9: Alternative rankings for catagory *Overall*; quality is given in score points (sp), CPU time in hours (h), kilo-watt-hours (kWh), wrong results in errors (E), rank measures in errors per score point (E/sp), joule per score point (J/sp), and score points (sp)

sults' the sum of false alarms and wrong proofs in number of errors, and column 'Rank Measure' gives the measure to determine the alternative rank.

*Correct Verifiers — Low Failure Rate.* The right-most columns of Table 8 report that the verifiers achieve a high degree of correctness (all top three verifiers in the *C-Overall* have less than 2‰ wrong results). The winners of category *Java-Overall* produced not a single wrong answer. The first category in


Table 10: New verifiers in SV-COMP 2020 and SV-COMP 2021

Table 11: Confirmation rate of verification witnesses in SV-COMP 2021


Table 9 uses a failure rate as rank measure: <sup>n</sup>umber of incorrect results total score , the number of errors per score point (E/sp). We use E as unit for number of incorrect results and sp as unit for total score. The worst result was 0.032 E/sp in SV-COMP 2020 and is now improved to 0.023 E/sp.

*Green Verifiers — Low Energy Consumption.* Since a large part of the cost of verification is given by the energy consumption, it might be important to also consider the energy efficiency. The second category in Table 9 uses the energy consumption per score point as rank measure: total CPU energ<sup>y</sup> total score , with the unit J/sp. The worst result from SV-COMP 2020 was 2 200 J/sp, now improved to 630 J/sp.

*New Verifiers.* To acknowledge the verification systems that participate for the first or second time in SV-COMP, Table 10 lists the new verifiers (in SV-COMP 2020 or SV-COMP 2021).

Verifiable Witnesses. Results validation is of primary importance in the competition. All SV-COMP verifiers are required to justify the result (True or False) by producing a verification witness (except for those categories for which no witness validator is available). We used six independently developed witness-based result validators and one witness linter (see Table 1).

Fig. 5: Number of evaluated verifiers for each year (first-time participants on top)

Table 11 shows the confirmed versus unconfirmed results: the first column lists the verifiers of category *C-Overall*, the three columns for result True reports the total, confirmed, and unconfirmed number of verification tasks for which the verifier answered with True, respectively, and the three columns for result False reports the total, confirmed, and unconfirmed number of verification tasks for which the verifier answered with False, respectively. More information (for all verifiers) is given in the detailed tables on the competition web site and in the results artifact; all verification witnesses are also contained in the witnesses artifact (see Table 4). The verifiers 2ls and UKojak are the winners in terms of confirmed results for expected results True and False, respectively. The overall interpretation is similar to SV-COMP 2020 [17].

#### 5 Conclusion

The 10th edition of the Competition on Software Verification (SV-COMP 2021) had 30 participating verification systems from 11 countries (see Fig. 5 for the participation numbers and Table 5 for the details). The competition does not only execute the verifiers and collect results, but also validates the verification results using verification witnesses. We used six independent validators to check the results and a witness linter to check if the verification witnesses are syntactically valid (Table 1). The number of verification tasks was increased to 15 201 in the C category and to 473 in the Java category. The high quality standards of the TACAS conference, in particular with respect to the important principles of fairness, community support, and transparency are ensured by a competition jury in which each participating team had a member. The results of our comparative evaluation provide a broad overview of the state of the art in automatic software verification. SV-COMP is instrumental in developing more reliable tools, as well as identifying and propagating successful techniques for software verification.

Data Availability Statement. The verification tasks and results of the competition are published at Zenodo, as described in Table 4. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 3. Furthermore, the results are presented online on the competition web site for easy access: https://sv-comp.sosy-lab.org/2021/results/.

#### References


Open Access. This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

\***Eva ul det a** \*

**nsistent \* Complete \* Well Documen et d**

**rtifact** \*SV-COMP

> **t ysaE \***

**o Reuse**

#### **Co\***TACAS \* **ACPALockator: Thread-Modular Analysis with Projections (Competition Contribution)**

Pavel Andrianov <sup>1</sup> , Vadim Mutilin1,<sup>3</sup> , and Alexey Khoroshilov1,2,3,<sup>4</sup>

 Ivannikov Institute for System Programming of RAS, Moscow, Russia Lomonosov Moscow State University, Moscow, Russia Moscow Institute of Physics and Technology, Moscow, Russia Higher School of Economics, Moscow, Russia

**Abstract.** Our submission to SV-COMP'21 is based on the software verification framework CPAchecker and implements the extension to the thread-modular approach. It considers every thread separately, but in a special environment which models thread interactions. The environment is expressed by projections of normal transitions in each thread. A projection contains a description of possible effects over shared data and synchronization primitives, as well as conditions of its application. Adjusting the precision of the projections, one can find a balance between the speed and the precision of the whole analysis.

Implementation on the top of the CPAchecker framework allows combining our approach with existing algorithms and analyses. Evaluation on the sv-benchmarks confirms the scalability and soundness of the approach.

**Keywords:** Multithreading · Projection · Thread-modular approach

#### **1 Verification Approach**

The main challenge for verification of industrial multithreaded software is to consider a potential thread interaction efficiently. Our verification approach is based on the thread-modular technique [4,5]. The approach allows avoiding a cartesian product of thread states by considering each thread state separately. Thus, an abstract state is not a complete one anymore and represents only one thread in a partial abstract state. However, due to this, the analysis has no information about transitions in other threads, which are strongly required for the soundness of the analysis. Thus, to not lose soundness we have to take into account the influence of other threads to the considered thread. For that purpose, we compute a special representation of the environment, which consists of a set of thread transitions, so-called projected transitions, or projections. The projections may be more or less precise, which strongly affects the precision and speed of the whole analysis. Note, the projections are independent and thus, a correct

<sup>-</sup>Representing jury member, corresponding author: andrianov@ispras.ru

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 423–427, 2021. https://doi.org/10.1007/978-3-030-72013-1 25

sequence is missed. Potentially, all projections may affect the other thread in any time. It is an overapproximation, leading to an imprecise analysis.

Let us explain, how we increase precision considering only compatible projections.

**Fig. 1.** Computation of a thread environment and its application

The figure 1 shows one step of the analysis. After computation of an abstract state in the first thread, we should spread the effect (x is a shared variable) to the other threads. Thus, we compute a projection of the operation. The projection is a part of the environment and affects the other threads through it. Then we apply a new effect to the other threads.

In the example, we lose the precision of the effect, abstracting from the assigned value (x = ∗). One of the key ideas of the proposed approach is to extend abstraction not only to states but also to operations, i.e. transitions. Thus, the projection may look like x = 1 and ∗ = ∗ in other configurations. That allows adjusting the level of abstraction of the environment for a specific task. By adjusting the configuration it is possible to vary not only an abstraction level but also to construct an algorithm that may be closer either to data-flow analysis or to software model checking.

To be able to construct precise analysis we suggest to encode not only abstract operations but also some conditions of its application, so-called guards. The guards are related to a predecessor abstract state, but they are not required to be equal to it. The guards store some information about variable values, locks, threads, or even abstract predicates. In the figure 1 the guard contains information about the initial value of the modified variable x (x == 0). A projection may be applied to a particular state if the guards allow it. We say, that the projection is compatible to an abstract state of the other thread. In our example the effect x = ∗ may be applied to the other thread only if the corresponding state does not contradict the condition x == 0.

More information about the approach and theoretical preliminaries can be found in [1]. Practical application of the theory to the Linux kernel drivers can be found in [2].

#### **2 Software Architecture**

CPALockator is based on the CPAchecker framework and has the same software architecture. Its key concept is CPA [3]. Each abstract domain is implemented in its own CPA. CPAs in the framework, i.e. value analysis or predicate analysis, can be combined to build an efficient and more precise approach. A configurable

algorithm, CEGAR in case of CPALockator, uses CPAs to construct a set of reachable states. In the figure 2 current configuration is presented. The highlighted components are implemented and used only in CPALockator. Lock analysis tracks acquired locks. It helps to compute thread effects that can be applied to a particular thread. Thread analysis determines whether two code blocks may be executed in parallel. Predicate analysis is extended to handle environment actions. It allows constructing a predicate abstraction in a thread-modular case. More information about CPALockator may be found in [1,2].

**Fig. 2.** Different CPAs in CPALockator configuration

#### **3 Strengths and Weaknesses**

First, we need to emphasize that the tool is targeted and used in practice for finding bugs in large industrial software systems, for example, operating system cores. We applied the tool to the Linux kernel and a number of private kernels of real-time OS. The main challenge is scalability there. And results on small but tricky sv-benchmarks look poor, just because of trade-off scalability vs. precision. Our tool is not so precise as other participants, but we show our scalability on a small set of complicated sv-benchmarks. However, it is useful for the community to have such comparison.

The thread-modular approach cannot solve tasks that contain control dependencies in the environment, as we consider all projections independently from each other and thus we lose their order. This is also a problem for witness validation, as the tool provides a path only in a single thread. It is a limitation of the approach, not only the tool itself. In practice we use more user-friendly format to analyze, visualize and evaluate error traces than witness validation [6]. However, the approach allows to simplify thread interaction, and the benefit is considerable for large complicated tasks, which cannot be analyzed with precise model checkers.

As the approach shows benefit for complicated tasks, like in ldv-linux-3.14 races directory. CPALockator correctly solves 4 of 7 those benchmarks and for one more obtains an imprecise counterexample. The rest of two tasks may be solved in the other, more faster, CPALockator configuration. The other tools mostly have problems with the benchmarks due to their complexity and size. The explanation of the results is rather evident. Most of the tools try to consider precise interaction between threads, while CPALockator abstracts from it and considers each thread separately. Note, the benchmarks have a strong hint for verifiers: there is only one assert to check while in the real world nobody knows where the bug may be located.

Overall results are not so good because of problems related both to the approach itself and its implementation. The majority of unknowns are related to unsupported atomic operations, like atomic functions, compare and swap and so on. Currently, our tool supports only synchronization operations based on locks, as the industrial software mostly contains them. Another problem is related to predicate analysis and interpolation. The current implementation of an interpolation procedure cannot produce interpolants for other threads, which limits the power of predicate analysis. Other problems are also present, but they are not so significant.

Anyway, CPALockator does not produce incorrect **true** verdicts, which confirms the soundness of the approach. All produced **true** verdicts are confirmed by validators, however, its amount is not so numerous, as we skip all tasks with unsupported functions. Thus, the presented approach may be used in combination with more precise techniques.

### **4 Tool Setup and Configuration**

We submitted CPALockator<sup>5</sup> built from svn revision 36155 for participation in the category Concurrency. The tool requires a Java 11 runtime environment. CPAchecker has to be executed with the following command line:

scripts/cpa.sh -svcomp21-lockator -spec reach.prp program.i

or via BenchExec tool.

#### **5 Project and Contributors**

The CPAchecker project is mainly developed by an international research group from the Ludwig-Maximilian University of Munich. CPALockator is based on CPAchecker and is developed and supported by researchers from Ivannikov Institute for System Programming of the Russian Academy of Sciences. We thank Dirk Beyer and the CPAchecker team for their work and fruitful discussions.

#### **References**


<sup>5</sup> https://doi.org/10.5281/zenodo.4486117


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Dartagnan: Leveraging Compiler Optimizations and the Price of Precision (Competition Contribution)

**Consistent \* Complete \* Well Documen et d**

**t ysaE \***

\***Eva ul det a** \* TACAS \* **Artifact** \*SV-COMP

**oReuse \***

Hernán Ponce-de-León <sup>1</sup>-, Thomas Haas <sup>2</sup>, and Roland Meyer <sup>2</sup>

<sup>1</sup>Bundeswehr University Munich, Munich, Germany <sup>2</sup>TU Braunschweig, Braunschweig, Germany hernan.ponce@unibw.de, t.haas@tu-braunschweig.de, roland.meyer@tu-bs.de

Abstract. We describe the new features of the bounded model checker Dartagnan for SV-COMP'21. We participate, for the first time, in the *ReachSafety* category on the verification of sequential programs. In some of these verification tasks, bugs only show up after many loop iterations, which is a challenge for bounded model checking. We address the challenge by simplifying the structure of the input program while preserving its semantics. For simplification, we leverage common compiler optimizations, which we get for free by using LLVM. Yet, there is a price to pay. Compiler optimizations may introduce bitwise operations, which require bit-precise reasoning. We evaluated an SMT encoding based on the theory of integers + bit conversions against one based on the theory of bit-vectors and found that the latter yields better performance. Compared to the unoptimized version of Dartagnan, the combination of compiler optimizations and bit-vectors yields a speed-up of an order of magnitude on average.

#### 1 Overview

Dartagnan is a bounded model checking (BMC) tool for reachability analysis. It takes a program and converts it to an SMT formula representing all its executions up to a given bound. This formula, together with a reachability condition representing assertions, is passed to an SMT solver (we use Z3 as a backend). If the formula is satisfiable, an execution violating an assertion exists.

Dartagnan was initially developed to verify small concurrent programs (written in the .litmus format) under weak memory models. Since 2020, it also supports Boogie *intermediate verification language* as its input language. For C programs, we use SMACK [8] to compile to LLVM and transform the compiled code to Boogie. Dartagnan's architecture, and main verification techniques (in particular how to efficiently handle different memory models) are described in [3,4,7]. Version 2.0.7 participating in SV-COMP'21 [1] can be downloaded from https://github.com/hernanponcedeleon/Dat3M directly as a java archive (.jar) or built from source code using the Maven build system. Dartagnan's

<sup>-</sup>Jury member.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 428–432, 2021. https://doi.org/10.1007/978-3-030-72013-1\_26

```
int main( void ) {
  unsigned int x = 1;
  unsigned int y = 0;
  while (y < 1024) {
    x = 0;
    y++;
  }
  __VERIFIER_assert(x == 0);
}
```
Fig. 1. Benchmark const\_1-1.c from the *ReachSafety-Loop* category.

verifier archive to reproduce the results of SV-COMP'21 is published at Zenodo under DOI 10.5281/zenodo.4483224.

Last year Dartagnan only participated in the *ConcurrencySafety* category. What is new for SV-COMP'21 is that Dartagnan also participates in (part of) the *ReachSafety* category for single threaded programs. Many tasks in that category contain loops of large bounds which impacts Dartagnan's performance. To address the problem, we propose to leverage compiler optimizations.

#### 2 Leveraging Compiler Optimizations

BMC techniques are very sensitive to the program syntax. The loop structure and the number of variables directly impact the size of the SMT formula (which tends to relate to solving times). Our approach is to simplify the structure of the program (while preserving its semantics) before performing the verification. We do this by using compiler optimizations.

Consider the program in Fig. 1 from the *ReachSafety-Loop* category. A BMC tool has to unroll the program 1024 times to prove the program correct. However, since the value of x is constant at every loop iteration, the assignment can be moved outside the loop. Since the value of y is never read, the instruction y++ can be removed (using dead store elimination) leading to an empty loop which can also be removed. Finally, using constant propagation, the assertion can be re-written as \_\_VERIFIER\_assert(0 == 0) which is trivially true.

All these optimizations are implemented in most optimizing compilers. Since we perform the verification after compiling to LLVM, we get them for free. Due to the high number of loop iterations, Dartagnan needs more than 15 minutes to verify the program above. However, by using the -O3 optimization flag in the C-to-Boogie transformation, the verification task can be solved within seconds.

Using an optimizing compiler has its risks. Most optimizations are unsound for concurrent programs [9] and we do not use any for *ConcurrencySafety*. Even for sequential programs, there is a price to pay. Some optimizations introduce bitwise operations (e.g. multiplications tend to be compiled to shift operations)

which were not present in the original program. We thus have to encode the semantics of such operations precisely.

#### 3 The Price of Precision

To guarantee soundness when using the aforementioned compiler optimizations in the *ReachSafety* category, we use two precise encodings of integers. The first is a new implementation based on the theory of bit-vectors, where we get bitprecise reasoning for free. The second was our original implementation and it is based on the theory of integers. It does an *on-demand* conversion to bitvectors and back (Int2Bv and Bv2Int). We are able to solve more benchmarks with the theory of bit-vectors than with the theory of integers plus conversion, which suggests that converting between the theories is expensive. For concurrent programs, the combination of bit-vectors with Dartagnan's memory-modeldependent encoding significantly degrades performance, and we use the theory of integers throughout the *ConcurrencySafety* category.

The trade-off between the efficiency of a theory and the precision in modeling semantics is well-known. In the context of symbolic execution, it was explored in [6]. SMACK implements an approach to diagnose spurious counterexamples caused by over-approximations and gradually refines the precision of reasoning about bitwise operations [5].

#### 4 Evaluation

We evaluated how compiler optimizations and different integer encodings affect Dartagnan's verification capabilities for some benchmarks in the *ReachSafety* category. We support two levels of optimization: -O0 (no optimization) and -O3 (enables most optimizations). For integer encodings we use two different approaches: theory of integers + bit conversions (QF\_LIA + QF\_BV logics) and pure theory of bit-vectors (QF\_BV logic).

The results are given in Fig. 2. We use Benchexec [2] for reliable benchmarking. The graph shows the verification time w.r.t the verification score. Following the competition scheme, correct counter-examples and proofs give +1 and +2 points respectively. Wrong counter-examples and proofs give -16 and -32 points. The absolute score values for incorrect results are higher because a single correct answer should not compensate for a wrong answer.

It can be seen that, regardless of the chosen integer encoding, using compiler optimizations allows us to verify many more benchmarks, thus obtaining a higher score. The total number of solved tasks with no optimizations (O0+Bit-vectors and O0+Int-exact configurations from Fig. 2) is 89 with 77 correct and 12 incorrect results. When using optimizations (O3+Bit-vectors and O3+Int-exact configurations), we solved 336 tasks with 326 correct and 10 incorrect results.

The experiments show that combining theories to achieve precision is more expensive than using pure bit-vectors. The total number of solved tasks when using QF\_LIA + QF\_BV (configurations O0+Int-exact and O3+Int-exact) is 201

Fig. 2. Comparing the performance of Dartagnan with different optimization flags and integer encodings.

with 187 correct and 14 incorrect results. When using QF\_BV (configurations O0+Bit-vectors and O3+Bit-vectors) we solved 224 tasks with 216 correct and 8 incorrect results. All encodings are guaranteed to be sound, the incorrect results are due to bugs in the verifier.

We used the evaluation described above to decide the configuration for SV-COMP'21. For category *ConcurrencySafety*, we use the integer encoding and no compiler optimizations. For categories *ReachSafety-Loop*, *ReachSafety-BitVectors* and *ReachSafety-Arrays*, Dartagnan uses the theory of bit-vectors and -O3 optimizations. These configurations are internally decided by the tool based on the use of the pthreads library. Compared with SV-COMP'20, we solved 60 more tasks in *ConcurrencySafety* (55% increase) and 474 more tasks overall (582% increase).

*Acknowledgement:* We thank the SMACK developers for their constant support with the C-to-Boogie transformation. We also thank Yun Zhang for her contributions to the development of the witness generation.

#### References


encodings. In *CAV*, volume 11561 of *LNCS*, pages 355–365. Springer, 2019. doi: 10.1007/978-3-030-25540-4\_19.


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Gazer-Theta: LLVM-based Verifier Portfolio with BMC/CEGAR (Competition Contribution)**

Zs´ofia Ad´am ´ <sup>1</sup> , Gyula Sallai<sup>2</sup> , and Akos Hajdu ´ <sup>1</sup>(-)

<sup>1</sup> Budapest University of Technology and Economics, Budapest, Hungary hajdua@mit.bme.hu <sup>2</sup> SonarSource S.A., Geneva, Switzerland

**Abstract.** Gazer-Theta is a software model checking toolchain including various analyses for state reachability. The frontend, namely Gazer, supports C programs through an LLVM-based transformation and optimization pipeline. Gazer includes an integrated bounded model checker (BMC) and can also employ the Theta backend, a generic verification framework based on abstraction-refinement (CEGAR). On SV-COMP 2021, a portfolio of BMC, explicit-value analysis, and predicate abstraction is applied sequentially in this order.

### **1 Verification Approach and Software Architecture**

Gazer-Theta is a software model checking toolchain with two main components: Gazer, an LLVM-based frontend and Theta, a generic model checking framework. An overview of the architecture and the verification approach can be seen in Figure 1.

**Fig. 1.** Overview of the architecture. Solid arrows represent the workflow, dashed arrows indicate dependency. Gazer and Theta components are denoted by lighter and darker backgrounds, respectively.

<sup>-</sup>Jury member representing Gazer-Theta at SV-COMP 2021.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 433–437, 2021. https://doi.org/10.1007/978-3-030-72013-1 27

Gazer. Gazer [7] is a verification frontend for C programs written in C++17, using the LLVM compiler infrastructure.<sup>3</sup> The input is a C program (possibly consisting of multiple source files) that is first translated to the LLVM IR (intermediate representation) using the clang compiler. Next, various built-in and custom LLVM passes are executed to perform optimizations (e.g., inlining, constant propagation, assertion lifting) and transformations (e.g., adding traceability information) on the IR. The LLVM IR is then transformed into different variants of control flow automata (CFA), depending on the backend to be used. Gazer includes a built-in variant [5,7] of bounded model checking [2], relying on the z3 SMT solver [6]. The other supported backend is Theta (to be presented below). Currently, both backends provide analysis for reachability properties.

In the final step, the "raw" results of the backends are processed to produce a verdict (safe, unsafe, unknown) and a witness. Currently, Gazer only supports violation witnesses, both in a user-friendly syntax and in the format of SV-COMP. Furthermore, Gazer is also capable of generating executable test harnesses that can be used, e.g., in a debugger to reach the property violation.

Theta. Theta [8] is a generic and modular model checking framework written in Java 11, providing abstraction- and CEGAR-based analyses [4] for various formalisms, including CFA. Theta is highly configurable, supporting different abstract domains (such as explicit-value analysis [1] or predicate abstraction [3]) and refinement strategies, mostly based on interpolation (using SMT solvers such as z3 [6]). In the explicit-value analysis, only a subset of program variables is tracked, while predicate abstraction keeps track of logical facts and relationships instead of concrete values.

Verification portfolio. Based on our preliminary experiments, at SV-COMP 2021, we apply a sequential portfolio consisting of 3 steps, as illustrated by Figure 2. The portfolio is implemented as a Python script, which calls the tools described previously. First, bounded model checking is performed with a 150s time limit, which – in our experience – can already solve many unsafe instances. If BMC is inconclusive, we move on to an explicit-value analysis with a 100s limit, which can be effective for simpler, mostly deterministic programs. Finally, if the result is still unknown, we move on to the more heavyweight method of predicate abstraction. If any of the phases reports an unsafe result, as an additional step, we generate an executable test harness from the counterexample and check if the program actually reaches the property violation. This allows us to filter out some false positives (by reporting unknown instead of unsafe).

### **2 Strengths and Weaknesses**

Gazer-Theta currently targets reachability analysis so we participate in the ReachSafety category, excluding subcategories Arrays, Heap and Sequentialized, due to features with limited support (e.g., pointers). The strength of the tool is

<sup>3</sup> https://llvm.org/

**Fig. 2.** Overview of the portfolio approach. Symbols -, ?, indicate safe, inconclusive and unsafe results, respectively. Numbers indicate the time limit of each phase.

its modularity and configurability, combining the advantages of different analyses into a diverse portfolio. Out of the 3679 tasks, there are 1722 confirmed correct (1079 safe, 643 unsafe), 4 unconfirmed correct, and 13 incorrect (false positive) results. A majority of the solved tasks (86% of 1722) come from the BMC phase; with a few exceptions, the CEGAR analyses need to be utilized only for safe instances (though they could also handle most of the tasks solved by BMC based on our experiments). The explicit-value analysis handles further 100 tasks in the ECA subcategory, while predicate abstraction solves 130 additional instances from Loops and ProductLines. Surprisingly, BMC can actually solve a significant amount (857) of safe instances as well, which can be attributed to LLVM optimizations and enhancements in the algorithm [7]. Furthermore, we also observed that executable harnesses could rule out many (142) false positives.

The weakness of Gazer-Theta is its limited support for certain features, such as arrays, bit-precise reasoning (only available for BMC), and pointers. We also observed that the LLVM IR representation often results in large CFA (e.g., many temporary variables due to SSA form), which makes reasoning harder via CEGAR (as witnessed, e.g., by the ECA subcategory). Currently, the tool gives empty correctness witnesses only meeting syntactical requirements, but surprisingly most of them were accepted. Furthermore, our violation witnesses are quite "sparse" due to heavy usage of optimization passes, but some validators can still prove their correctness. The 13 false positive results are caused by unsupported library functions (related to floats) treated as external calls with undefined (arbitrary) behavior.

#### **3 Tool Setup and Configuration**

The competition contribution is based on Gazer v1.2.1<sup>4</sup> and Theta v2.5.0.<sup>5</sup> Additionally, the BMC backend of Gazer uses z3 version 4.8.6, while Theta is based on z3 version 4.5.0. The projects' repositories contain instructions on building the tools, but an archive can be found on Zenodo<sup>6</sup> with pre-built binaries

<sup>4</sup> https://github.com/ftsrg/gazer/releases/tag/v1.2.1

<sup>5</sup> https://github.com/ftsrg/theta/releases/tag/v2.5.0

<sup>6</sup> http://doi.org/10.5281/zenodo.4483627

for Ubuntu 18.04 or 20.04. The toolchain requires packages clang-9, libgomp1, llvm-9, openjdk-11-jre-headless and python3 to be installed. The entry point of the toolchain is scripts/gazer starter.py, which takes the verification task (C program) as its only mandatory input and runs the portfolio. No other parameters or configuration is required. Optionally, the output directory can be set (--output) and the version can be queried (--version).

### **4 Software Project**

Gazer and Theta are maintained by the Critical Systems Research Group<sup>7</sup> of the Budapest University of Technology and Economics with various contributors. The projects are available open-source on GitHub<sup>8</sup> under an Apache 2.0 license.

Acknowledgment. The authors would like to thank Tam´as T´oth, L´aszl´o Radnai, Mih´aly Dobos-Kov´acs, Istv´an Majzik, Zolt´an Micskei, Andr´as V¨or¨os and Vince Moln´ar for their contributions to the projects; and the competition organizers, especially Dirk Beyer for their help during the preparation for SV-COMP.

This research has received funding from the EU ECSEL JU under the H2020 Framework Programme, JU grant nr. 826452 (Arrowhead Tools project) and from the partners' national funding authorities.

### **References**


<sup>7</sup> https://ftsrg.mit.bme.hu

<sup>8</sup> https://github.com/ftsrg/gazer and https://github.com/ftsrg/theta

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Goblint: Thread-Modular Abstract Interpretation Using Side-Effecting Constraints (Competition Contribution)

Simmo Saan1(-), Michael Schwarz2(-) , Kalmer Apinis1, Julian Erhard2, Helmut Seidl2, Ralf Vogler2, and Vesal Vojdani<sup>1</sup>

<sup>1</sup> University of Tartu, Tartu, Estonia {simmo.saan, kalmer.apinis, vesal.vojdani}@ut.ee <sup>2</sup> Technische Universität München, Garching, Germany {m.schwarz, julian.erhard, helmut.seidl, ralf.vogler}@tum.de

Abstract. Goblint is a static analysis framework for C programs specializing in data race analysis. It relies on thread-modular abstract interpretation where thread interferences are accounted for by means of flow-insensitive global invariants.

#### 1 Verification Approach

Goblint is a static analyzer for C programs based on the framework of abstract interpretation [5]. It performs flow- and context-sensitive interprocedural analysis, using partial tabulation to handle procedure calls. The analysis of concurrent programs is thread-modular: analyzing each thread in isolation, as opposed to analyzing their interleavings. This scales well to larger programs with many threads. Interferences between threads happen through global variables, which are abstracted by a context- and flow-insensitive global invariant. When no other thread can interfere, copies of global variables are *privatized* within the local state. Their values may deviate from the global invariant due to local updates, thereby improving precision [11].

The analysis is specified using a side-effecting constraint system [3], in which right-hand sides of constraints can, during their evaluation, make additional contributions (*side effects*) to other constraint system variables. These side effects can be conveniently used both to express partial context-sensitivity of function calls and to add contributions to the global invariant. Such a constraint system is solved using a *local* generic solver, which yields a (post-)solution for just the reachable program points and contexts [1,8]. Solving is not strictly separated into widening and narrowing phases, but these may be intertwined instead [1]. Results of the analysis are reported only at the end based on the computed solution, as widening during the fixpoint computation might lead to spurious property violations, which later disappear due to narrowing.

<sup>-</sup>Jury member

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 438–442, 2021. https://doi.org/10.1007/978-3-030-72013-1\_28

*Reachability Safety.* Reachability is mainly determined using value analysis, which, for integers, employs abstract domains based on intervals and exclusion sets. The value analysis also handles pointers (computing points-to information), heap memory (using allocation-site abstraction), structs, unions and arrays. The abstraction of arrays employs partitioning by the symbolic expression that is used to index into the array. On top of that, both global variables and heap-allocated memory are partitioned into disjoint regions [9].

*No Overflows.* The sound interval analysis is implemented using arbitrary precision integers. If the interval for an expression lies completely in the value range of its signed integer type, no overflow can occur at this location.

*No Data Race.* The main goal of Goblint is data race detection and its analyses have been optimized for this purpose. Mutexes may be handled both pathsensitively and symbolically. Memory accesses are partitioned (e.g., by heap region [9]), while locking expressions and access expressions are correlated using address equalities (e.g., a domain of affine and Herbrand equalities [10]) in order to analyze more sophisticated locking patterns [11].

#### 2 Software Architecture

Goblint is implemented in OCaml and uses an updated fork of CIL [6] as its parser frontend for the C language. Since the latter requires preprocessed code, GCC is executed for preprocessing the input, although this step should be unnecessary on the SV-COMP benchmarks. No other major libraries or external tools are required.

The architecture of Goblint [2] is designed to be modular. Analyses, which are defined by their abstract domains and transfer functions, can be activated via runtime configuration options. A flexible query system allows for communication between analyses. Together, the combined analyses and the control-flow graphs of the functions in the program provide the side-effecting constraint system, which is solved by some local generic solver. While a number of solvers are available, the improved top-down solver TD3 [8] was employed for SV-COMP 2021. Post-processing the solution yields results for the analysis.

#### 3 Strengths and Weaknesses

Due to over-approximation, abstract interpretation as employed by Goblint can only determine whether the correctness specification *must hold* or *may be violated*, but not whether a concrete violating execution exists. Therefore, to avoid a large number of false alarms due to imprecision in SV-COMP, Goblint only reports results "true" and "unknown" respectively. This is a clear limitation of our approach, as all competing tools do report definite violations. The strength of our approach, on the other hand, is that it aims to be sound by design (up to out-of-scope features of the input program as, e.g., inline assembler). This is

evidenced by the fact that Goblint does not produce any incorrect results in the competition.

Goblint performs best in the *SoftwareSystems* and *ReachSafety-Product-Lines* categories that consist of larger real-world programs, for which our approach is well suited. On the downside, our verifier performs poorly in reachability safety categories that contain smaller programs with intricate correctness conditions which our abstract domains cannot express.

Even though the support for checking overflows is very new in Goblint, it has some success in the *NoOverflows* category. Unfortunately, the tool has no success in *SoftwareSystems-\*-NoOverflows*.

Although Goblint specializes in concurrency, it performs quite poorly in the *ConcurrencySafety* category. We believe this is because most benchmarks in the category require rather precise analysis of thread interleavings, which is not done in our thread-modular approach.

As Goblint has been optimized for data race detection, it unsurprisingly performs better in the *NoDataRace* demo category. It must be noted that the majority of benchmarks in the category were submitted from our own test suite, consisting of racy and race-free programs.

While the analyses can be fine-tuned via configuration options, the parameters are static and do not currently depend on the property nor the input program. A more granular and dynamic configuration system would allow increased precision, by enabling more expensive analyses where necessary, or decreased resource usage, by disabling unnecessary analyses, e.g., concurrency analyses on single-threaded programs. Furthermore, integrating counterexample-guided abstraction refinement (CEGAR) into our framework might allow Goblint to also report violations, while avoiding false alarms and gaining more precision.

#### 4 Tool Setup and Configuration

Goblint version svcomp21-0-g82e03b87 participated in SV-COMP 2021 [4,7]. It is available in both binary (Ubuntu 20.04) and source code form at our GitHub repository under the svcomp21 tag.<sup>3</sup> The only runtime dependency is GCC. Instructions for building from source can be found in the README.

Both the tool-info module and the benchmark definition for SV-COMP are named goblint. They correspond to running the tool as follows:

./goblint --conf conf/svcomp21.json --sets ana.specification property.prp input.c

Goblint participated in the following categories: *ReachSafety*, *Concurrency-Safety*, *NoOverflows*, *SoftwareSystems* (while opting-out from *SoftwareSystems- \*-MemSafety*) and *NoDataRace* (demo category).

<sup>3</sup> https://github.com/goblint/analyzer/releases/tag/svcomp21

#### 5 Software Project and Contributors

Goblint development takes place on GitHub,<sup>4</sup> while related publications are listed on its website.<sup>5</sup> It is an MIT-licensed joint project of the Technische Universität München (Chair of Formal Languages, Compiler Construction, Software Construction) and University of Tartu (Laboratory for Software Science).

*Acknowledgements.* This work was supported by Deutsche Forschungsgemeinschaft (DFG) – 378803395/2428 ConVeY and the Estonian Research Council grant PSG61. We would like to thank everyone who has contributed to Goblint over the years.

#### References


<sup>4</sup> https://github.com/goblint/analyzer

<sup>5</sup> https://goblint.in.tum.de

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Towards String Support in JayHorn (Competition Contribution)**

Ali Shamakhi<sup>1</sup> (-), Hossein Hojjat1,<sup>2</sup> , and Philipp R¨ummer<sup>3</sup>

> <sup>1</sup> University of Tehran, Tehran, Iran {ali.shamakhi,hojjat}@ut.ac.ir <sup>2</sup> Tehran Institute for Advanced Studies, Tehran, Iran <sup>3</sup> Uppsala University, Uppsala, Sweden philipp.ruemmer@it.uu.se

**Abstract.** JayHorn is a Horn clause-based model checker for Java programs that has been competing at SV-COMP since 2019. An ongoing research and implementation effort is to add support for String data-type to JayHorn. Since current Horn solvers do not support strings natively, we consider a representation of (unbounded) strings using algebraic datatypes, more precisely as lists. This paper discusses Horn clause encodings of different string operations, and presents preliminary results.

#### **1 The JayHorn Approach and Architecture**

We start by summarising the approach used in JayHorn, and refer to earlier papers [5,6,7] for more details. JayHorn is a verification tool that encodes sequential Java programs as sets of Constrained Horn Clauses (CHCs) in order to check for possible assertion violations. The main CHC encoding in JayHorn is inspired by refinement types [2] and liquid types [8], and characterises programs in terms of method contracts, state invariants, and instance invariants of classes [5]. This encoding is over-approximate, and can prove absence of assertion violations. In order to find counterexamples, i.e., prove existence of violations, JayHorn also offers a bounded, under-approximate program encoding.

JayHorn is entirely implemented in Java, and uses the Soot framework [10] to process Java bytecode, and the CHC solver Eldarica [3] to solve Horn clauses.

#### **2 Encoding of String Operations**

In this paper, we focus on the handling of Strings and their operations, a feature of Java that was not previously supported by JayHorn. Since JayHorn verifies programs without imposing bounds on the number of execution steps or the size of input data, our goal is to handle also unbounded strings. Unfortunately, while there has been significant progress in SMT solving for strings, current CHC solvers do not yet support strings natively. We therefore use recursive algebraic data types to model strings, and follow the approach proposed in [4]: strings are represented using lists, with a binary constructor cons and the constant nil.

c The Author(s) 2021 J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 443–447, 2021. https://doi.org/10.1007/978-3-030-72013-1 29

There are two ways to encode a string using cons and nil. The Left-To-Right (LTR) encoding starts with the leftmost character of the string. For example, "Jay" = cons('J', cons('a', cons('y', nil))). The Right-to-Left (RTL) encoding starts with the rightmost character. Each encoding has its own benefits and drawbacks in modeling various operations, an aspect we evaluate in this paper.

Three different LTR encodings of the concatenation operation are described in [4], and equivalent RTL encodings are easy to define. Moving beyond concatenation, in this paper we show models of some of the more involved operations.

#### **2.1 The CompareTo Operation**

The String.compareTo method in Java returns an integer, which is the difference of the length of strings if one of the strings is a prefix of the other (e.g., "cat".compareTo("c") == 2), or the difference of their leftmost same-index different characters otherwise (e.g., "card".compareTo("cash") == -1, since their leftmost same-index different characters are 'r' and 's', respectively).

The method is modeled using predicate Prec(left, right, comparison result) under LTR encoding, which allows us to recursively remove leftmost characters from both strings to reach a state which the comparison result is known.

$$\begin{array}{rcl}P\_{rec}(x,\mathsf{nil},\mathsf{len}(x)) & \leftarrow & true\\P\_{rec}(\mathsf{nil},y,-\mathsf{len}(y)) & \leftarrow & true\\P\_{rec}(x,x,0) & \leftarrow & true\\P\_{rec}(\mathsf{cons}(j,x),\mathsf{cons}(k,y),j-k) & \leftarrow & j\neq k\\P\_{rec}(\mathsf{cons}(h,x),\mathsf{cons}(h,y),d) & \leftarrow & P\_{rec}(x,y,d)\end{array}$$

The predicate under RTL encoding needs an extra argument to keep track of whether the comparison result is based on character difference or not, so the predicate is P rec(left, right, comparison result, char diff ). The clauses use the len function to compute the length of a string, which is a built-in function in Eldarica.

$$\begin{array}{rcl}P'\_{\rm rec}(x,\mathsf{nil},\mathsf{len}(x),false) & \leftarrow & true\\P'\_{\rm rec}(\mathsf{nil},y,-\mathsf{len}(y),false) & \leftarrow & true\\P'\_{\rm rec}(x,x,0,false) & \leftarrow & true\\P'\_{\rm rec}(\mathsf{cons}(h,x),y,d+1,false) & \leftarrow P'\_{\rm rec}(x,y,d,false)\wedge\mathsf{len}(x)\geq \mathsf{len}(y)\\P'\_{\rm rec}(x,\mathsf{cons}(h,y),d-1,false) & \leftarrow P'\_{\rm rec}(x,y,d,false)\wedge\mathsf{len}(x)\leq \mathsf{len}(y)\\P'\_{\rm rec}(\mathsf{cons}(j,x),\mathsf{cons}(k,x),j-k,true) & \leftarrow j\neq k\\P'\_{\rm rec}(\mathsf{cons}(h,x),y,d,true) & \leftarrow P'\_{\rm rec}(x,y,d,true)\\P'\_{\rm rec}(x,\mathsf{cons}(h,y),d,true) & \leftarrow P'\_{\rm rec}(x,y,d,true)\\\end{array}$$

#### **2.2 Integer to String conversion**

The integer to string conversion relies on extracting digits one by one, which is done using integer arithmetic. Under LTR encoding, during the conversion process, the pre-condition stores the rest of the input after removing the converted digits so far starting from the lowest position. For example, if the number is i = d<sup>n</sup>−1···d<sup>0</sup> and the converted string so far is s = "d<sup>k</sup>−1···d0", the rest of the number will be r = d<sup>n</sup>−1···d<sup>k</sup> which is stored at the pre-condition.

The pre-condition in RTL encoding stores the offset of the next digit that needs to be extracted, since extracting digits from highest place values requires knowing their positions.

#### **2.3 StartsWith and EndsWith**

The encoding of String.startsWith method needs to consider different states of both strings and their relation, which leads to multiple recursive relations.

For example, if x starts with y, we can prepend c to both strings under LTR encoding (to get x and y ) and the condition holds on the resulting strings (i.e. x starts with y ). For another example, if x does not start with y and len(x) ≥ len(y) we can append c to x under RTL encoding (to get x ) and the condition holds on the resulting string (i.e. x does not start with y).

$$\begin{array}{cccc} S\_{\rm rec}(x, \text{nil}, true) & \leftarrow & true\\ S\_{\rm rec}(x, x, true) & \leftarrow & true\\ S\_{\rm rec}(\text{nil}, y, false) & \leftarrow & \mathsf{len}(y) > 0\\ S\_{\rm rec}(\text{cons}(j, x), \mathsf{cons}(k, y), false) & \leftarrow & S\_{\rm rec}(x, y, false)\\ \text{(LTR)} & S\_{\rm rec}(\text{cons}(h, x), \mathsf{cons}(h, y), true) & \leftarrow S\_{\rm rec}(x, y, true)\\ \text{(LTR)} & S\_{\rm rec}(\text{cons}(j, x), \mathsf{cons}(k, y), false) & \leftarrow j \neq k\\ \text{(RTL)} & S\_{\rm rec}(\text{cons}(h, x), y, true) & \leftarrow S\_{\rm rec}(x, y, true)\\ \text{(RTL)} & S\_{\rm rec}(\text{cons}(j, x), \mathsf{cons}(k, x), false) & \leftarrow j \neq k\\ \text{(RTL)} & S\_{\rm rec}(\text{cons}(h, x), y, false) & \leftarrow S\_{\rm rec}(x, y, false) \land \mathsf{len}(x) \ge \mathsf{len}(y)\\ \text{(RTL)} & S\_{\rm rec}(x, \mathsf{cons}(h, y), false) & \leftarrow S\_{\rm rec}(x, y, false)\\ \end{array}$$

The RTL encoding of endsWith is the same as LTR encoding of startsWith, and the LTR encoding of endsWith is the same as RTL encoding of startsWith.

#### **2.4 CharAt**

The encoding definition of String.charAt relies on the fact that prepending a character to a string under LTR encoding increases indices of all previous characters by one, while appending a character to a string under RTL encoding does not change those indices.

(LTR) ChAtrec(cons(h, t), 0, h) ← true (LTR) ChAtrec(cons(h, t), i + 1, c) ← ChAtrec(t, i, c) ∧ 0 ≤ i < len(t) (RTL) ChAtrec(cons(h, t), len(t), h) ← true (RTL) ChAtrec(cons(h, t), i, c) ← ChAtrec(t, i, c) ∧ 0 ≤ i < len(t)

#### **3 Performance of the String Encoding**

The following table shows the results of JayHorn on the 53 problems in the SV-COMP Java track that involve strings. Many of the programs contain string

operations that are not yet handled in JayHorn, but the results already make it possible to compare encoding choices. Uniformly, RTL performs better than LTR (probably because appending characters to strings is more common than adding characters in the beginning), and the under-approximating CHC encoding of JayHorn performs better than the over-approximate encoding (probably because over-approximation too often loses information about string contents). The choice between Iterative, Recursive, or Recursive-with-precondition [4] for string concatenation surprisingly had no effect on the results.


In other respects, JayHorn performed similarly in SV-COMP 2021 [1] as in the two previous years. JayHorn gave one incorrect answer, for the problem UnsatAddition02 and due to the use of unbounded integer arithmetic instead of correct Java machine arithmetic semantics. JayHorn could correctly prove 125 benchmarks safe, and 151 benchmarks unsafe. Changes compared to 2020 include 59 of the 64 MinePump benchmarks (by encoding enums, see Section 4) and 6 of the 53 string benchmarks that JayHorn solves now.

The biggest factor influencing the performance of JayHorn in SV-COMP is still the incomplete model of the Java API in JayHorn, given the large number of API tests among the SV-COMP Java benchmarks. Our work on supporting Strings, described in this paper, is one of the efforts to address the situation.

#### **4 Tool Setup**

The version submitted to SV-COMP 2021 is JayHorn version 0.7.5-strings,<sup>4</sup> which is also available on Zenodo [9]. In the configuration used in the competition,<sup>5</sup> JayHorn only applies the Horn solver Eldarica. The Benchexec tool info module is called jayhorn.py and the benchmark definition file jayhorn.xml. JayHorn competes in the Java category.

Since JayHorn only has incomplete support for Java enums, in this year we added a small source transformation tool<sup>6</sup> to JayHorn that has the purpose of replacing enums with simple integer variables. The script used in the competition applies the transformation tool to the benchmark source code prior to compilation to bytecode.

<sup>4</sup> https://github.com/jayhorn/jayhorn/releases/tag/v0.7.5-strings

<sup>5</sup> Java options: -Xss40000k -Xmx12g JayHorn options: -inline-size 50 -conservative -specs -string-encoding recursiveWithPrec -string-direction rtl

<sup>6</sup> https://github.com/jayhorn/jayhorn/tree/devel/enum-eliminator

#### **5 Software Project and Contributors**

JayHorn was initially developed by Temesghen Kahsai, Philipp R¨ummer, and Martin Sch¨af, with contributions by Daniel Dietsch, Rody Kersten, Huascar Sanchez, and Valentin W¨ustholz [6,7]. Further development of the tool is at the moment mainly carried out by the authors of this paper. JayHorn is open source, and distributed under MIT license on https://github.com/jayhorn/jayhorn.

Acknowledgements. The work on JayHorn has been supported by the Swedish Research Council (VR) under grant 2018-04727, by the Swedish Foundation for Strategic Research (SSF) under the project WebSec (Ref. RIT17-0011), and by grants from Microsoft and Amazon Web Services.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### JDart: Portfolio Solving, Breadth-First Search and SMT-Lib Strings (Competition Contribution)

Malte Mues (-) and Falk Howar

TU Dortmund University, Dortmund, Germany {malte.mues, falk.howar}@tu-dortmund.de

Abstract. JDart performs dynamic symbolic execution of Java programs: it executes programs with concrete inputs while recording symbolic constraints on executed program paths. A portfolio of constraint solvers is then used for generating new concrete values from recorded constraints that drive execution along previously unexplored paths. For SV-COMP 2021, we improved JDart by implementing exploration strategies, bounded analysis, and path-specific constraint solving strategies, as well as by enabling the use of SMT-Lib string theory for encoding of string operations.

#### 1 Overview

JDart is a dynamic symbolic execution engine for the Java virtual machine (JVM) built on top of Java PathFinder (JPF) [12]. We first entered SV-COMP 2020 with JDart. Our corresponding report gives a short overview of JDart's architecture and internals [9]. In this paper, we focus on the description of the following three improvements that were explicitly motivated by SV-COMP 2021 [2].


While all three changes contribute to an improved performance of JDart, portfolio solving has by far the biggest impact on the number of analyzed benchmark instances of SV-COMP 2021. In this paper, we focus on the description of the changes for (1) and (2).

### 2 Tool Improvements for SV-COMP 2021

JDart runs as an extension of the JPF software model checker [12], using the Java virtual machine implemented by JPF and its capabilities for annotating

```
© The Author(s) 2021
```
J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 448–452, 2021. https://doi.org/10.1007/978-3-030-72013-1\_30

Fig. 1: The architecture and call hierarchy in the constraint solving backend.

values on the stack and the heap with symbolic information. The tool itself is written in Java and uses JConstraints [6] for encoding SMT problems. Moreover, JConstraints acts as a frontend to the Z3 [5] or CVC4 [1] SMT solver used for finding concrete values that drive the analysis.

Exploration Strategies. JDart has two main components: the *Executor* and the *Explorer*. While the *Executor* runs the concrete analysis and records symbolic constraints during concrete execution, the *Explorer* is responsible for exploration strategies and management of constraints. We re-designed the central data structure of the *Explorer*, the constraints tree, for SV-COMP 2021: The new tree supports different exploration strategies (e.g., breadth-first search) and bounds on the depth of exploration. In the past, JDart relied on unbounded depth-first exploration which would often 'get trapped' unrolling unbounded loops or recursion. Breadth-first search prevents this behavior and is more effective on the SV-COMP benchmark set.

Portfolio-Solving. Figure 1 demonstrates the architecture of the constraint solving backend used by JDart and JConstraints for SV-COMP; dashed components and control-flow have been added for SV-COMP 2021: The *bounding solver* (developed for SV-COMP 2020) calls subsequent solvers with successively weaker bounds on numeric variables. For SV-COMP 2021, we use upper bounds 2, 8, 13, 21, 200, 600, ∞ and symmetric negative lower bounds. The new *pathspecific solver* selects the most promising solving approach for every concrete path constraint: Currently, constraints involving string operations, type casts, or floating-point numbers are handed to the portfolio solver as we expect better performance. The *portfolio solver* wraps the CVC4 solver, starting repeated solving attempts in the case of (fairly frequent and random) segmentation faults as well as invocation of Z3 after a fixed timeout of 60 seconds. All other path constraints are passed directly to the Z3 solver as JDart used to do with all constraints at SV-COMP 2020.

#### 3 Strengths and Weaknesses

JDart scored 623 points (max. of 693) in the Java track and was declared second winner for Java, after Java Ranger (630 points) [11]. Next best is JBMC [4] with 603 points. As Java Ranger and JBMC, JDart did not report a single incorrect verdict. JDart exhibits the general strengths and weaknesses of dynamic and symbolic analysis approaches for Java programs:

Fast search for counterexamples. Driven by concrete execution, the analysis is fairly fast. JDart (950s)is overall the second fastest tool in cases where it can provide an answer after JBMC (650s). Notably, JDart successfully found counterexamples in 251 of 253 instances. The second-best tool in this respect is JBMC with 243 correct *false* verdicts. Of the two instances for which JDart did not produce counterexamples one uses the split operation for strings that JDart does not yet model, leading to an *unknown* result. For the other instance, stack unrolling triggers an out of memory exception during the concolic execution of one path through the recursive Ackermann function.

Path Explosion. JDart is affected by path explosion in programs with long sequences of branching instructions with mutually unrelated conditions. Such sequences are common in code generated from models in the realm of embedded systems, e.g., by the *Alarm* benchmark instances in SV-COMP 2021. For these instances, JDart does not manage to explore all paths in the given time limit.

Unbounded Behavior. Based on principles of symbolic execution, JDart will only terminate on unbounded loops or in case of unbounded recursion when using manually configured bounds. In addition, the concolic execution might be configured to stop on property violations. As a consequence, assertion errors might be used as analysis bounds. For SV-COMP 2021, we used a search depth of 270 recorded decisions on paths in the constraints tree which we deemed conservative after initial experiments on the benchmark set: While in 13 instances *true* verdicts were given after exploring exhaustively up to the depth bound, there remain 30 problem instances for which JDart timed out exploring the search space up to the depth bound and 6 instances raising *unknown* verdicts (including the two mentioned above).

### 4 Tool Setup

The source code of JDart used for the competition artifact [8] is available on GitHub<sup>1</sup>. JDart is designed as a plug-in for JPF and relies on ant as a build system. One of its dependencies is the jpf-core project [12]. The other dependency is the JConstraints library, which was configured to use Z3 [5] and CVC4 [1] for SV-COMP 2021. For the competition, JDart is wrapped by the run-jdart.sh shell script which generates .jpf configuration files, specifying which benchmark to analyze and the global configuration options of JDart. For SV-COMP 2021, we choose termination on the first assertion error, a depth bound of 270 (decisions on paths in the constraints tree) for exploration, breadth first search as exploration strategy, and the described path-specific solver together with iterative weakening of bounds on values in models as described in Section 2. Z3 is configured to run with the sequence solver for strings. The shell script records and interprets the output of JDart and can also report the version of JDart.

<sup>1</sup> https://github.com/tudo-aqua/jdart, Commit 4a9cc43

#### 5 Software Project

JDart, as used in SV-COMP 2021, is maintained by the Automated Quality Assurance Group at TU Dortmund University (in particular by the authors of this paper) and is available under the Apache License, version 2.0, on GitHub<sup>1</sup>. An initial version of JDart was developed by the authors of [7] at NASA Ames Research Center and Carnegie Mellon University. The original version of JDart is available on GitHub<sup>2</sup>.

#### References


<sup>2</sup> https://github.com/psycopaths/jdart


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Symbiotic 8: Beyond Symbolic Execution***<sup>∗</sup>* **(Competition Contribution)**

Marek Chalupa<sup>1</sup> -, Tom´aˇs Jaˇsek1, Jakub Nov´ak1, Anna Recht´ ˇ aˇckov´a1, Veronika Sokov´ ˇ a<sup>2</sup> , and Jan Strejˇcek<sup>1</sup>

<sup>1</sup> Masaryk University, Brno, Czech Republic <sup>2</sup> Brno University of Technology, FIT, Brno, Czech Republic

**Abstract.** Symbiotic 8 extends the traditional combination of static analyses, instrumentation, program slicing, and symbolic execution with one substantial novelty, namely a technique mixing symbolic execution with k-induction. This technique can prove the correctness of programs with possibly unbounded loops, which cannot be done by classic symbolic execution. Symbiotic 8 delivers also several other improvements. In particular, we have modified our fork of the symbolic executor Klee to support the comparison of symbolic pointers. Further, we have tuned the shape analysis tool Predator (integrated already in Symbiotic 7) to perform better on llvm bitcode. We have also developed a light-weight analysis of relations between variables that can prove the absence of outof-bound accesses to arrays.

#### **1 Verification Approach**

Symbiotic is a program analysis framework that combines fast static analyses with code instrumentation and program slicing to speed up the code verification which is then performed by symbolic executor Klee [3] (or, alternatively, by another supported verification tool). The main improvement in Symbiotic 8 is a new verification technique combining symbolic execution with k-induction [8] that we call KindSE.

**Symbolic execution with k-induction (KindSE)** KindSE applies the idea of k-induction [8] to paths of the control flow graph. The approach can be roughly described by the following three steps.


<sup>∗</sup> This work has been supported by the Czech Science Foundation grant GA20-07487S. -

Jury member and the corresponding author: chalupa@fi.muni.cz.

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 453–457, 2021. https://doi.org/10.1007/978-3-030-72013-1 31

3. If P is empty, the control flow graph contains no feasible path of length k (or more) leading to an error location and thus we report that the program is correct. If P is not empty, we replace each path π ∈ P by paths of length k + 1 that have π as its suffix, increase k by one, and go to step 2.

To improve the performance, we further extended the algorithm to summarize loop iterations. If we process a program location that is a loop header, we start unwinding the loop backwards. We over-approximate the states that we get in every loop iteration to cover more than one iteration if possible. If we are successful, the summarized loop states form an inductive invariant, which can help to prove that no error location is reachable from the loop header in k steps. Our loop summarization does not handle nested loops (in this case we fall-back to the algorithm without loop summarization) and calls of functions. To fix the latter restriction, we inline all procedures (if possible) before running KindSE.

KindSE is implemented in our prototype tool Slowbeast [1] which we integrated into Symbiotic 8. The tool now supports only the unreach-call property. Slowbeast can also work as a standard symbolic executor (without kinduction), but it is noticeably slower than Klee and it has some limitations. However, it supports symbolic floating point arithmetics, which Klee does not.

**Workflow of Symbiotic 8** As the first step, a given program is translated to llvm [6]. If the program contains a call to pthread create, Symbiotic returns unknown as it cannot handle parallel programs. The rest of the workflow then depends on the verified property, as indicated in Figure 1.

For unreach-call property, we call slicer to remove instructions that have no influence on the property and run Klee. If Klee does not decide in 222 seconds, we run KindSE in Slowbeast. If it fails, we run Klee again and if it also fails, we run Slowbeast as a standard symbolic executor. If some tool says

**Fig. 1.** The workflow of Symbiotic 8

that the specified call is unreachable, we return true with the trivial witness. If we detect that the specified call is reachable, we try replaying the error path on the unsliced program. If the replay confirms that the call is reachable, we return false with the error witness generated from the replay.

For other properties, we instrument the program with the help of various analyses. For example, when checking memory safety, we use Predator [5], DG [4], and a values-relations analysis to detect potentially unsafe instructions. If Predator says that all instructions are safe, we directly return true. Otherwise, we slice the program with respect to potentially unsafe instructions and call Klee. The rest of the process is identical to the previous case.

#### **2 Software Architecture**

All components of Symbiotic 8 use llvm 10 [6]. Scripts that call and control the components according to a given configuration are written in Python.

Instrumentation module is written in C++. In Symbiotic 8, we have newly integrated a values-relations analysis as a plugin into instrumentation. This analysis is able to prove valid some accesses into arrays. We have also improved llvm frontend of Predator [5] to perform similarly well as the gcc frontend.

Program slicing module is written in C++ and is build around the library DG [4]. This year, we sped up the slicer by using more efficient data structures in pointer analysis and by using function summaries in data dependence analysis.

We use our own fork of Klee [3] that differs from the upstream Klee mainly in using segment-offset pointer representation which allows for better handling of symbolic pointers and symbolic-sized allocations. This year, we mended handling of symbolic pointers and added support for comparison of symbolic addresses.

Tool Slowbeast [1] is written in Python. Both, Klee and Slowbeast use Z3 [7] as the SMT solver.

#### **3 Strengths and Weaknesses**

Symbolic execution may be very efficient in finding bugs but suffers from the path explosion problem which may prevent it from fully analyzing programs with high level of branching. We alleviate this problem by using program slicing. However, in the presence of unbounded loops or infinite execution paths, program slicing does not help unless it removes the unbounded computation from the program. Indeed, classical symbolic execution is unable to verify such programs at all.

To fight the inability of symbolic execution to verify unbounded programs, we use KindSE. However, its implementation in Slowbeast is still not fully matured and it handles only a very restricted set of programs.

**Results of Symbiotic 8 in SV-COMP 2021** Symbiotic 8 won MemSafety and SoftwareSystems categories [2]. In the MemSafety category, we lost many points in the new MemSafety-Juliet subcategory. These benchmarks contain threads and Symbiotic immediately answered unknown due to the syntactic check mentioned in Section 1. However, most of these benchmarks actually do not spawn any thread and thus Symbiotic could analyze them. The victory in SoftwareSystems category is mainly due to the dominance on the new uthash benchmarks.

This year, over 500 correct answers produced by Symbiotic were not confirmed. Some of these cases must be accounted to the fact that Symbiotic generates only trivial correctness witnesses. However, there are also unconfirmed answers because of missing witnesses, which turned out to be a bug in Slowbeast integration. Unfortunately, these include all 99 benchmarks that were newly proved correct by KindSE, from which 85 were in the ReachSafety-Loops subcategory. We had also many unconfirmed witnesses for non-termination violation that still need to be investigated.

Symbiotic had 16 incorrect answers: 14 incorrect true in Termination category and 2 incorrect false in ReachSafety-Floats. All of them were caused by last-minute commits that were fixed shortly after the submission deadline. Because of these mistakes, Symbiotic ended up on the 4th place instead of on the 2nd in the Termination category.

In the Overall meta-category, Symbiotic traditionally took the 4th place as every year since 2018.

### **4 Tool Setup and Project Contributors**

The archive is available at https://doi.org/10.5281/zenodo.4483882. Run Symbiotic as:

bin/symbiotic --sv-comp --prp <prpfile> [--32] <source>

The option --prp sets the verified property and --32 tells Symbiotic to assume 32-bit architecture (64-bit architecture is assumed by default).

### **5 Software Project and Contributors**

Symbiotic 8 for SV-COMP 2021 has been developed by Marek Chalupa, Tom´aˇs Jaˇsek, Jan Nov´ak, and Anna Recht´ ˇ aˇckov´a under the supervision of Jan Strejˇcek. Veronika Sokov´ ˇ a provided a valuable help with adjusting Predator modifications. Symbiotic is available under the MIT license. All the external components that the tool uses are also available under open-source licenses that comply with SV-COMP's policy for the reproduction of results. The source code of Symbiotic can be found at:

https://github.com/staticafi/symbiotic

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **VeriAbs: A Tool for Scalable Verification by Abstraction (Competition Contribution)**

Priyanka Darke (-), Sakshi Agrawal, and R Venkatesh

TCS Research, Pune, India {priyanka.darke, agrawal.sakshi4, r.venky}@tcs.com

**Abstract.** VeriAbs is a strategy selection-based reachability verifier for C programs. The selection of a suitable strategy is from a pre-defined set of strategies and by taking into account the syntax and semantics of the code to be verified. This year we present VeriAbs version 1.4.1 in which a novel preprocessor to strategy selection is introduced. The preprocessor checks for the feasibility of performing a lightweight slicing of the input code using function call graph and variable reference information. By this if the program is found to be sliceable, sub-programs or slices are generated, and the known strategy selection algorithm of VeriAbs is applied to each slice. The verification results of each slice are then composed to derive that of the entire program. This compositional verification has improved the scalability of VeriAbs and presented in this paper.

#### **1 Verification Approach**

VeriAbs is a C program verifier using a portfolio of twelve verification techniques [2]. These techniques are organized into four strategies as shown in Figure 1. Each of the strategies is defined such that it benefits verification of a specific type of programs. A program type is identified by a strategy selector based on the following code-structural and variable-data properties: (1) unstructured control flow, (2) loops with arrays, (3) short input ranges, and (4) numerical loops in code. The strategy selector looks for these properties in the given order and assigns a verification strategy to the code. For this it uses code-structure and interval analyses [2]. If the assigned strategy is unable to verify the program, it exits unless if the program contains arrays. In that case it selects the default strategy corresponding to numerical loops. Kindly refer to [2,3] for details on each verification technique implemented in VeriAbs.

The colored blocks in Figure 1 indicate the enhancements to the tool made this year and are explained next. The colored block with a dashed outline indicates that the component has been added for the first time in VeriAbs, and that with a solid outline indicates that a block that existed in older versions has been modified. The dashed arrows indicate information flow added this year. This information is the verification result of the respective strategy passed back to the slicer-analyzer explained in the next section. Besides these, there are changes in witness generation strategies and explained in the next section.

<sup>-</sup>Jury member

c The Author(s) 2021

J. F. Groote and K. G. Larsen (Eds.): TACAS 2021, LNCS 12652, pp. 458–462, 2021. https://doi.org/10.1007/978-3-030-72013-1 32

**Fig. 1.** VeriAbs Architecture (**S**: Program Safe, **F**: Property Fails, **U**: Unknown)

#### **1.1 Tool Enhancements**

Slicer-Analyzer. It has the following responsibilities: (1) checking the sliceability of input program P, (2) generating slices P1,P2,...,P<sup>r</sup> if P is sliceable, and (3) computing the verification result R of P. Accordingly, the slicer-analyzer comprises of three parts. The first part checks for sliceability. Let main be the entry function of P. We define P to be sliceable with respect to main if all distinct functions f1,f2,...,f<sup>r</sup> directly called from main are defined in P, and are independent of each other. We define the functions called from main independent iff main is nonrecursive; contains no loops or unstructured control flow [2]; there is no transitive dependence (made up control and data dependence) between calls to f1,...,f<sup>r</sup> in main; no two functions in f1,f2,...,f<sup>r</sup> transitively call the same function; and if F(fi) is the union of f<sup>i</sup> and functions transitively called by fi, then no two sets in F(f1),F(f2),...,F(fr) refer to the same global variable in the program. That is, if V (F(fi)) is the set of global variables referred by functions in F(fi) then ∀m,n | 1≤m≤r, 1≤n≤r, m= n =⇒ V (F(fm))∩V (F(fn))=∅. The call graph and referred variables information is computed using call-trees, and a light-weight flow insensitive pointer analysis.

void main () { b=30,c=10; if(!a) f1(); else if(b) f2(); ... } f1(){c++;} f2(){b=0; assert(b);} **Fig. 2.** Input Code

If above stated conditions are satisfied then using concepts presented in [10], the body of main is sliced with respect to call(s) to f<sup>i</sup> to create the entry function main<sup>i</sup> of the executable slice Pi. Since main is sliced with respect to calls to fi, P<sup>i</sup> will only have functions in F(fi) and main<sup>i</sup> . That is, the set of functions in slice P<sup>i</sup> is given by <sup>F</sup>(fi) <sup>∪</sup> main<sup>i</sup> . This way the set of all slices are generated by the second part of the slicer-analyzer.

The proposed technique of slicing has the potential to greatly reduce the state space of the input program. This hypothesis is supported by experimental results presented later. The proposed slicing function uses control- and data-flow information local to main, hence it is lightweight.

Consider the example in Figure 2. One slice from this code is given in Figure 3. As seen, function main has been sliced with respect to the call to f2 in Figure 3

which contains the error. Function f1 need not be analyzed to find the error. This type of slicing is helpful in analyzing large code in which the verifier may run out of resources while analyzing an irrelevant function like f1.

void main () { b=30,c=10; if(!a) ; else if(b) f2(); ... } f2(){b=0; assert(b);}

**Fig. 3.** One Slice

Next, VeriAbs applies its strategy selection to each slice Pi,∀i,1≤i≤r sequentially. The results of each slice are composed to compute R, the verification result of P, by the third part of the slicer-analyzer as follows: if an error trace is realized for any slice then R is set to failure; if all slices are proved to be safe, then R is set to safe; otherwise if none of the slices are found to be erroneous and there exists a slice that could not be verified, then R

is set to unknown.

This idea of slicing based on function call and variable reference information has been proposed for the first time. It is similar to a concept of clustering presented in [12]. Both these techniques partition a given application into independently executable slices. But [12] forms clusters with respect to un-called functions in the code base. The proposed sliceability criterion on the other hand focuses only on functions called from a given (entry) function main. It uses control- and data-flow analyses local only to the given function to slice it with respect to calls in its body. This in turn removes all functions not called from main. Another technique generates multiple backward slices at every calling context with respect to a property to be verified [8]. The proposed slicing technique however produces slices with respect to functions defined in P and called from main.

Witness Generation From Slices: VeriAbs stores slices in the form of separate C programs. To generate a valid witness from a slice it is critical to report the correct line numbers in the witness [5]. The slicer-analyzer maintains correct line numbers in the slice with respect to the original code by adding #line directives to it. The directives are added at every point in the slice which reads values from the environment, starts a block of code, or contains a branching condition. The witness generated from such a slice in VeriAbs is valid with respect to the original program.

Experimental results: The proposed slicing led to VeriAbs successfully analyzing 120 additional programs in ReachSafety in SV-COMP'21. On the other hand it runs out of time while verifying eighteen programs that it could successfully analyze earlier. This is due to the additional time required to slice. Overall these values demonstrate the feasibility of this approach.

Next we present modifications made to existing components of VeriAbs.

Strategy 1: Unstructured Control Flow. The first strategy meant for programs with unstructured control flow, thus far executed two verification techniques in parallel. The two techniques were evolutionary test generation algorithms using grey box fuzzing [13], and k-induction with continuously refined invariants [6]. This year we do not use the first algorithm in strategy 1. The reason being that the time taken by it to generate useful error traces is very large. We observe that as the program complexity increases with the number of constraints, branching conditions, and/or non-determinism, so does the time to reach the error by the test evolution algorithm. This leads to the effect of no apparent advantage of the algorithm when applied in

parallel with k-induction. We present our experimental observations of the given algorithm in [2]. On the other hand, not using this algorithm led to time savings and verification of a few additional programs. We continue to use this algorithm for non-reactive loops and for programs with inputs of short ranges (strategy 3) [2]. Here we allocate it an independent thread with no time limits, while results are obtained quickly for non-reactive loops.

Witness Generation. This year VeriAbs uses the same strategies as last year to generate violation witnesses [3]. For correctness witnesses VeriAbs derives invariants from the over-approximation techniques in its portfolio. To save time this year VeriAbs does not extract invariants from k-induction [6] and interpolation [11] to generate correctness witnesses. From amongst the impacted witnesses, this led to 12 fewer witnesses being validated than last year.

#### **2 Software Architecture**

VeriAbs uses Vajra to perform full program induction [7], American Fuzzy Lop [13] to perform test evolution with fuzzing, and CPAchecker v1.8 [6] in the first strategy for k-induction. For bounded model checking VeriAbs uses the C Bounded Model Checker (CBMC) v5.10 [9] with the Glucose Syrup SAT solver v4.0 [4]. All remaining program analyses are implemented in the TCS Research group's program analysis framework called Prism [12]. The slicer-analyzer and the strategy selector are partly implemented in perl.

#### **3 Strengths and Weaknesses**

The main strengths of VeriAbs lie in its (1) portfolio of sound verification techniques, and its ability to (2) perform a lightweight slicing, (3) classify programs based on structural and variable data properties of code, and (4) match these code properties with suitable verification techniques. The main weakness of VeriAbs lies in its lack of an integrated implementation of witness generation that can utilize invariants derived across all strategies or techniques. This is because the invariants are to be derived from various abstractions, some of which are generated by off-the-shelf tools, and not yet extracted.

#### **4 Tool Setup and Configuration**

The VeriAbs SV-COMP 2021 executable is available for download at https://gitlab. com/sosy-lab/sv-comp/archives-2021/-/tree/master/2021/veriabs.zip. To install the tool, download the archive, extract its contents, and then follow the installation instructions in VeriAbs/INSTALL.txt. To execute VeriAbs, the user needs to specify the property file using the --property-file option. The witness is generated in the current working directory as witness.graphml. VeriAbs participated in the ReachSafety category of SV-COMP 2021. The BenchExec wrapper script for the tool is veriabs.py and the benchmark description file is veriabs.xml. A sample command is as follows: VeriAbs/scripts/veriabs --property-file reach-safety.prp a.c

#### **5 Software Project and Contributors**

Few members of the Foundations of Computing group at TCS Research [1] maintain VeriAbs. They can be contacted at veriabs.tool@tcs.com. We thank past developers of VeriAbs, creators of Prism [12], Vajra, CPAchecker and CBMC. We specially thank Bharti Chimdyalwar, Shrawan Kumar and Ulka Shrotri for their insightful reviews.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### Author Index

Abate, Alessandro I-370 Abbasi, Rosa II-242 Ádám, Zsófia II-433 Agrawal, Sakshi II-458 Ahmed, Daniele I-370 Ahrendt, Wolfgang II-242 Alur, Rajeev I-430 Amir, Guy II-203 André, Étienne I-311 Andrianov, Pavel II-423 Andriushchenko, Roman I-191 Apinis, Kalmer II-438 Arias, Jaime I-311 Ashok, Pranav II-326

Backenköhler, Michael I-210 Baek, Seulkee I-59 Bansal, Suguman I-20 Barrett, Clark I-113, II-145, II-203 Bendík, Jaroslav I-291 Beneš, Nikola II-64 Beyer, Dirk II-401 Biere, Armin I-133, II-357 Biewer, Sebastian II-365 Bisping, Benjamin I-3 Blondin, Michael II-3 Bonakdarpour, Borzoo I-94 Bortolussi, Luca I-210 Brim, Luboš II-64 Bryant, Randal E. I-76 Budde, Carlos E. II-373

Carneiro, Mario I-59 Černá, Ivana I-291 Češka, Milan I-191 Chalupa, Marek II-453 Chatterjee, Krishnendu I-20 Chattopadhyay, Agnishom I-330 Chen, Ran II-262 Christakis, Maria II-43 Cohen, Aviad II-87

Darke, Priyanka II-458 Darulova, Eva II-43, II-242

Erhard, Julian II-438 Ernst, Gidon II-24

Fedyukovich, Grigory II-24 Felgenhauer, Bertram II-127 Ferreira, Margarida I-152 Finkbeiner, Bernd II-365 Furuse, Jun II-262

Ganesh, Vijay II-303 Gieseking, Manuel II-381 Giesl, Jürgen I-250 Gol, Ebru Aydin I-291 Gorostiaga, Felipe II-349 Griggio, Alberto I-113 Großmann, Gerrit I-210

Haas, Thomas II-428 Haase, Christoph II-3 Hajdu, Ákos II-433 Hark, Marcel I-250 Hartmanns, Arnd II-373 Hausmann, Daniel I-38 Hecking-Harbusch, Jesko II-381 Hermanns, Holger II-365, II-389 Heule, Marijn J. H. I-59, I-76, II-223 Hojjat, Hossein II-443 Howar, Falk II-448 Hsu, Tzu-Han I-94 Huang, Cheng-Chao I-389

Igarashi, Atsushi II-262 Irfan, Ahmed I-113

Jackermeier, Mathias II-326 Jašek, Tomáš II-453

Jeangoudoux, Clothilde II-43 Junges, Sebastian I-173, I-191

Katoen, Joost-Pieter I-173, I-191, I-230 Katz, Guy II-203 Kaufmann, Daniela II-357 Kawata, Akira II-262 Khoroshilov, Alexey II-423 Klauck, Michaela II-389 Köhl, Maximilian A. II-365, II-389 Křetínský, Jan II-326

Lam, Wing I-270 Lepiller, Julien II-105 Li, Jianlin I-389 Li, Renjue I-389 Li, Yahui I-430 Lochmann, Alexander II-127 Lohar, Debasmita II-43 Loo, Boon Thau I-430 Lynce, Inês I-152

Majumdar, Rupak I-449 Mamouras, Konstantinos I-330 Mann, Makai I-113 Marinov, Darko I-270 Martins, Ruben I-152 Meyer, Fabian I-250 Meyer, Roland II-428 Middeldorp, Aart II-127 Mitterwallner, Fabian II-127 Mues, Malte II-448 Mutilin, Vadim II-423 Myreen, Magnus O. II-223

Nadel, Alexander II-87 Nejati, Saeed II-303 Nestmann, Uwe I-3 Niemetz, Aina II-145, II-303 Nishida, Yuki II-262 Novák, Jakub II-453

Offtermatt, Philip II-3 Osama, Muhammad I-133

Padon, Oded I-113 Pastva, Samuel II-64 Peruffo, Andrea I-370 Petrucci, Laure I-311 Piskac, Ruzica II-105

Platzer, André II-181 Pol, Jaco van de I-311 Ponce-de-León, Hernán II-428 Preiner, Mathias II-145, II-303 Quatmann, Tim I-230 Řechtáčková, Anna II-453 Reger, Giles II-164 Reynolds, Andrew II-145 Rümmer, Philipp II-443 Ryvchin, Vadim II-87 Saan, Simmo II-438 Šafránek, David II-64 Saito, Hiromasa II-262 Sallai, Gyula II-433 Sánchez, César I-94, II-349 Santolucito, Mark II-105 Schäf, Martin II-105 Schiffl, Jonas II-242 Schmid, Stefan I-411 Schnepf, Nicolas I-411 Schnitzer, Yannik II-365 Schoisswohl, Johannes II-164 Schröder, Lutz I-38 Schwarz, Michael II-438 Schwenger, Maximilian II-365 Scott, Joseph II-303 Seidl, Helmut II-438 Sencan, Ahmet I-291 Shamakhi, Ali II-443 Shi, Lei I-430 Sobel, Joshua II-43 Šoková, Veronika II-453 Sotoudeh, Matthew II-281 Spel, Jip I-173 Srba, Jiří I-411 Strejček, Jan II-453 Suenaga, Kohei II-262 Sun, Jun I-389

Tan, Yong Kiam II-181, II-223 Terra-Neves, Miguel I-152 Thakur, Aditya V. II-281 Thinniyam, Ramanathan S. I-449 Tinelli, Cesare II-145

Ulbrich, Mattias II-242

Vardi, Moshe Y. I-20 Venkatesh, R. II-458 Ventura, Miguel I-152 Vogler, Ralf II-438 Vojdani, Vesal II-438 Voronkov, Andrei II-164

Wang, Jingyi I-389 Wang, Zhifu I-330 Wei, Anjiang I-270 Weinhuber, Christoph II-326 Weininger, Maximilian II-326 Weiss, Gail I-351 Wijs, Anton I-133

Wolf, Verena I-210 Wu, Haoze II-203 Xie, Tao I-270 Xue, Bai I-389 Yadav, Mayank II-326 Yang, Pengfei I-389 Yanich, Ann II-381 Yellin, Daniel M. I-351 Yi, Pu I-270

Zetzsche, Georg I-449 Zhang, Lijun I-389